On Mon, Jun 5, 2017 at 10:32 PM, Ben Goertzel <[email protected]> wrote:
> For the first stages of the language learning process, as I understand > it, what we really need to do is just > > > *** > 1. Split input into sentences (which can be done with lots of > sentence-splitters, including ours) > Except that they all suck, and when you look around, there aren't really that many of them. The one we have is not great, but its the simplest, smallest, best I could find. Ben, you have this habit of trivializing everything. Theoretically, yes, sentence-splitting is one of those trivializable problems, but when you have to actually do it, and make things work: it takes time and effort. Over the last ten years, I've worked with 5 or 6 different sentence splitters, and I've devoted maybe a month of my times screwing with them: downloading them, installing, testing, plugging them into the infrastructure, fixing the inevitable bugs that manifest, then having to re-run things after fixing the bugs. and then sitting there, with a blank brain, staring at the screen, waiting to see if it crashes again. And when it doesn't you go pee, come back, check the logs, cross your fingers, and check again in a few hours. A *month* is a long time to devote to sentence splitting. But that's just how it is. To you it seems all very easy, And this is the classic programmer pitfall: imaging something takes only one afternoon. But it never ever does. The actual practice is both time-consuming and tediously boring. So I kind of resent it when you talk about how trivial this is and how it can be done anywhere by anything. > > 2. For each sentence S: > > -- do some stemming (again that can be done with lots of stemmers, > including our own), so that each word is associated with a stem > There is absolutely no stemming done. Its fundamentally incorrect to stem. > -- identify each pair of words (W1, W2) so that W1 occurs before W2 in S > Only if you do clique counting. We argued about clique counting already. It sucks. anyway, its pointless, because the goal is to parse. > -- create and update some Atoms based on the pair (W1, W2) > Yes ... > *** > > The Atoms to be created are something like > > -- an EvaluationLink indicating a link-parser AnyLink between W1 and W2 > Yes. > -- updating a couple counts based on Linas's recent code > Yes. > > I'm in too much of a hurry to look up the format of these links right > now but Ruiting probably remembers or I could look it up this > afternoon.. > > That would do it right? > Do what? > > The current approach replaces the "identify each pair of words" step > with "do a random link parse", but it is not clear that doing a random > link parse is actually better; my own feeling is that it probably > isn't, but we debated this extensively before without resolution.... > To replicate the current pipeline better one would replace > > -- identify each pair of words (W1, W2) so that W1 occurs before W2 in S > Well, you've got to add some kind of length limits. How far apart do you want the words to be? > > with > > -- do a random link parse, then identify each pair of words (W1,W2) > that are linked in the random link parse > Well, the idea is that the parses get less and less random, over time. So doing length-limited clique counting is .. pointless. Its just make-work for someone to write the code. Don't create make-work for no reason. Again: see my tirade about how something is theoretically trivial, but can be excruciatingly time-consuming in practice. > > ... > > Am I missing something? > > The above could all be done in C++ perfectly well; it doesn't require > Guile because it doesn't require any of the fancy stuff in the current > NLP pipeline... > What's the point? Why bother? why would you want to do this? what does it accomplish? what does it solve? --linas > > -- Ben > > > > > > > On Tue, Jun 6, 2017 at 11:07 AM, Ben Goertzel <[email protected]> wrote: > > On Tue, Jun 6, 2017 at 3:10 AM, Linas Vepstas <[email protected]> > wrote: > >> Re: running LG in the same adress space as the atomspace: this has > already > >> been done; the surreal code does this. In a day or 2 or 3 you could > write > >> the needed wrapper code to have LG live directly inside of opencog, > >> generating the correct atoms, thus totally bypassing guile and garbage > >> collection. And this would be a very easy way to get a 3x speedup, if > >> that's really your end-goal. Its a lot wasier than all the other crazy > >> schemes discussed. > > > > > > yeah, we were discussing this yesterday... I think we may do something > > like this... we will discuss again this afternoon... > > > > > > > > -- > > Ben Goertzel, PhD > > http://goertzel.org > > > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > > boundary, I am the peak." -- Alexander Scriabin > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34pmYRNZxrrno_sx9o7m0Zn2N2vZuA7S-Q8VujR6yBwBA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
