I've got databases so large that they don't fit into RAM, and I've crashed my system several times because of this. I've had to regenerate them over an over, as I've discovered and patched various bugs and/or experienced data loss in several different ways. e.g. the wikipedia vs. gutenberg issue, several thunderstorms that scrambled SQL cause I ahdn't configured it for safety :-( stuff like that. And I've done this for multiple languages... So I find it hard to believe that you're stuck on step zero. All of my issues are elsewhere.
I can send you a medium-sized dataset. Its got some issues, still, but its usable. Its ready to go, its got 8M distinct word pairs observed 270M times, and 7M distinct connector sets, observed 260M times. It's small enough that it loads fairly fast, and you can load all of it, if you wish, it probably fits in within 64GB RAM, maybe --linas On Wed, Jun 7, 2017 at 1:43 AM, Ben Goertzel <[email protected]> wrote: > On Wed, Jun 7, 2017 at 1:37 PM, Linas Vepstas <[email protected]> > wrote: > > > > Well, you've got to add some kind of length limits. How far apart do you > > want the words to be? > > That's a parameter one can set, it could be n=10 or n=20 ... a while > ago we passed around some papers indicating the length of an average > dependency link in English, Chinese and other languages... > > >> The above could all be done in C++ perfectly well; it doesn't require > >> Guile because it doesn't require any of the fancy stuff in the current > >> NLP pipeline... > > > > > > What's the point? Why bother? why would you want to do this? what does it > > accomplish? what does it solve? > > It would get us a way of doing Step 0 of the language learning pipeline > that > > -- works smoothly without hitting weird hard-to-solve Guile bugs and such > > -- grabs possible dependencies in a way that seems less wasteful to me > than generating a bunch of random parses (i.e. just building links btw > words that are reasonably near each other) > > Obviously this is not the interesting part of the language learning > algorithm ... but we need to get this first part working reliably and > reasonably rapidly to get to the other parts, right? So far Ruiting > hasn't been able to get to the point of experimenting with clustering > and disambiguation algorithms because the Step 0 code hits these Guile > bugs. But there's no need to have complex code or Guile code for Step > 0, all we need is simpler stuff... > > -- Ben > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA356i9CHbki6%2BRXp__0x9tU2Q%3De2_QDHD51u_Qx4qiAxQw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
