OK, yes, for part 1) I think the README file explains almost all but the newest and greatest steps in detail. I'll update it shortly to add the newest steps. If there's confusion, ask. I keep detailed notes mostly because I can't remember how to replicate any of this stuff, myself.
It would really really help if someone could find & prepare some clean text of some kind of adventure novels or young-adult lit, or any kind of narrative literature. Maybe from project gutenberg. I've discovered that wikipedia has 3 major faults: * its has very few action verbs * its got lots of lists and tables and names and dates, weird punctuation. * its got lots of articles about movies and rock bands with stupid names, all of which glom up the statistics with garbage. That, and lots of geographical and product names and model numbers, which add lots of obscure or nonsense words, and don't help with grammar, at all. again, lists of names, dates, football leagues, awards, recording contracts, run-times, publisher names. This is almost my new #1 priority, I think. Any help would help. for part 2) it would be a database dump. To figure out what to do with it, going at least partly into part 1) would clarify what's in there. Some versions of some of these databases are getting huge -- 50 million or 100 million atoms, which take around a kbyte of ram each so 50 or more GB to load it into RAM. So typically, you don't want to actually load it all... except during MST parsing. There are two hard parts to clustering. One is writing all the code to get the clusters working in the pipeline. I guess I'll have to do that. The other is dealing with words with multiple meanings: "I saw the man with the saw" and clustering really needs to distinguish saw the verb from saw the noun. Not yet clear about the details of this. i've a glimmer of the general idea, ... --linas On Wed, May 10, 2017 at 9:21 PM, Ben Goertzel <[email protected]> wrote: > Very cool stuff !! > > So there are two things I'm thinking Ruiting can do in this regard, in > the near term... > > 1) run this on a larger corpus and see what happens > > 2) try various clustering approaches on the "feature structures" for > words implicit in the parse-trees you've gotten from this first phase > > For this we would need the following... > > For 1), we'd need some good instructions on how to replicate the > experiments you've just run (potentially on an additional text corpus) > > For 2), we'd need an Atomspace (Scheme file, postgres dump, whatever) > containing the first-pass parses you've obtained for the sentences in > your test corpus > > Can you share these w/ Ruiting sometime soon? > > thanks! > > > > > > On Thu, May 11, 2017 at 6:34 AM, Linas Vepstas <[email protected]> > wrote: > > Attached PDF reports on a small, early snapshot of what the database > looks > > like. Basically, it looks promising. I'm moving on to the next step, > which > > is to reparse with the clusters. There's various parts of the theory I > don't > > understand, as well as a lot of code to write, to build a pipeline from > the > > atomspace back into the link-grammar parser. > > > > --linas > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36-Xr19qUud7kYLmiqURuhNWtDgg2%2BXZ1yzQz9VzLimSg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
