[opencog-dev] Re: Language Learning - Progress and Performance

Linas Vepstas Tue, 06 Jun 2017 22:38:07 -0700

On Mon, Jun 5, 2017 at 10:32 PM, Ben Goertzel <[email protected]> wrote:

> For the first stages of the language learning process, as I understand
> it, what we really need to do is just
>
>
> ***
> 1. Split input into sentences (which can be done with lots of
> sentence-splitters, including ours)
>

Except that they all suck, and when you look around, there aren't really
that many of them. The one we have is not great, but its the simplest,
smallest, best I could find.

Ben, you have this habit of trivializing everything. Theoretically, yes,
sentence-splitting is one of those trivializable problems, but when you
have to actually do it, and make things work: it takes time and effort.
Over the last ten years, I've worked with 5 or 6 different sentence
splitters, and I've devoted maybe a month of my times screwing with them:
downloading them, installing, testing, plugging them into the
infrastructure, fixing the inevitable bugs that manifest, then having to
re-run  things after fixing the bugs.  and then sitting there, with a blank
brain, staring at the screen, waiting to see if it crashes again. And when
it doesn't you go pee, come back, check the logs, cross your fingers, and
check again in a few hours.

A *month* is a long time to devote to sentence splitting.  But that's just
how it is.  To you it seems all very easy, And this is the classic
programmer pitfall: imaging something takes only one afternoon.  But it
never ever does.

The actual practice is both time-consuming and tediously boring. So I kind
of resent it when you talk about how trivial this is and how it can be done
anywhere by anything.

>
> 2. For each sentence S:
>
> -- do some stemming (again that can be done with lots of stemmers,
> including our own), so that each word is associated with a stem
>

There is absolutely no stemming done.  Its fundamentally incorrect to stem.

> -- identify each pair of words (W1, W2) so that W1 occurs before W2 in S
>

Only if you do clique counting. We argued about clique counting already. It
sucks.  anyway, its pointless, because the goal  is to parse.

> -- create and update some Atoms based on the pair (W1, W2)
>

Yes ...

> ***
>
> The Atoms to be created are something like
>
> -- an EvaluationLink indicating a link-parser AnyLink between W1 and W2
>
Yes.

> -- updating a couple counts based on Linas's recent code
>

Yes.

>
> I'm in too much of a hurry to look up the format of these links right
> now but Ruiting probably remembers or I could look it up this
> afternoon..
>
> That would do it right?
>

Do what?

>
> The current approach replaces the "identify each pair of words" step
> with "do a random link parse", but it is not clear that doing a random
> link parse is actually better; my own feeling is that it probably
> isn't, but we debated this extensively before without resolution....
> To replicate the current pipeline better one would replace
>
> -- identify each pair of words (W1, W2) so that W1 occurs before W2 in S
>

Well, you've got to add some kind of length limits.  How far apart do you
want the words to be?

>
> with
>
> -- do a random link parse, then identify each pair of words (W1,W2)
> that are linked in the random link parse
>

Well, the idea is that the parses get less and less random, over time. So
doing length-limited clique counting is .. pointless.   Its just make-work
for someone to write the code.  Don't create make-work for no reason.

Again: see my tirade about how something is theoretically trivial, but can
be excruciatingly time-consuming in practice.

>
> ...
>
> Am I missing something?
>
> The above could all be done in C++ perfectly well; it doesn't require
> Guile because it doesn't require any of the fancy stuff in the current
> NLP pipeline...
>

What's the point? Why bother? why would you want to do this? what does it
accomplish? what does it solve?

--linas

>
> -- Ben
>
>
>
>
>
>
> On Tue, Jun 6, 2017 at 11:07 AM, Ben Goertzel <[email protected]> wrote:
> > On Tue, Jun 6, 2017 at 3:10 AM, Linas Vepstas <[email protected]>
> wrote:
> >> Re: running LG in the same adress space as the atomspace: this has
> already
> >> been done; the surreal code does this. In a day or 2 or 3 you could
> write
> >> the needed wrapper code to have LG live directly inside of opencog,
> >> generating the correct atoms, thus totally bypassing  guile and garbage
> >> collection.  And this would be a very easy way to get a 3x speedup, if
> >> that's really your end-goal.  Its a lot wasier than all the other crazy
> >> schemes discussed.
> >
> >
> > yeah, we were discussing this yesterday... I think we may do something
> > like this... we will discuss again this afternoon...
> >
> >
> >
> > --
> > Ben Goertzel, PhD
> > http://goertzel.org
> >
> > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
> > boundary, I am the peak." -- Alexander Scriabin
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
> boundary, I am the peak." -- Alexander Scriabin
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA34pmYRNZxrrno_sx9o7m0Zn2N2vZuA7S-Q8VujR6yBwBA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Language Learning - Progress and Performance

Reply via email to