[opencog-dev] Re: Word similarity database report

Linas Vepstas Wed, 10 May 2017 20:28:48 -0700

OK, yes,

for part 1) I think the README file explains almost all but the newest and
greatest steps in detail.  I'll update it shortly to add the newest steps.
If there's confusion, ask.  I keep detailed notes mostly because I can't
remember how to replicate any of this stuff, myself.

It would really really help if someone could find & prepare some clean text
of some kind of adventure novels or young-adult lit, or any kind of
narrative literature.  Maybe from project gutenberg. I've discovered that
wikipedia has 3 major faults:
* its has very few action verbs
* its got lots of lists and tables and names and dates, weird punctuation.
* its got lots of articles about movies and rock bands with stupid names,
all of which glom up the statistics with garbage. That, and lots of
geographical and product names and model numbers, which add lots of obscure
or nonsense words, and don't help with grammar, at all.  again, lists of
names, dates, football leagues, awards, recording contracts, run-times,
publisher names.

This is almost my new #1 priority, I think.  Any help would help.

for part 2) it would be a database dump.  To figure out what to do with it,
going at least partly into part 1) would clarify what's in there.

Some versions of some of these databases are getting huge -- 50 million or
100 million atoms, which take around a kbyte of ram each so 50 or more GB
to load it into RAM.  So typically, you don't want to actually load it
all... except during MST parsing.

There are two hard parts to clustering. One is writing all the code to get
the clusters working in the pipeline.  I guess I'll have to do that.  The
other is dealing with words with multiple meanings: "I saw the man with the
saw" and clustering really needs to distinguish saw the verb from saw the
noun.  Not yet clear about the details of this.  i've a glimmer of the
general idea, ...

--linas

On Wed, May 10, 2017 at 9:21 PM, Ben Goertzel <[email protected]> wrote:

> Very cool stuff !!
>
> So there are two things I'm thinking Ruiting can do in this regard, in
> the near term...
>
> 1) run this on a larger corpus and see what happens
>
> 2) try various clustering approaches on the "feature structures" for
> words implicit in the parse-trees you've gotten from this first phase
>
> For this we would need the following...
>
> For 1), we'd need some good instructions on how to replicate the
> experiments you've just run (potentially on an additional text corpus)
>
> For 2), we'd need an Atomspace (Scheme file, postgres dump, whatever)
> containing the first-pass parses you've obtained for the sentences in
> your test corpus
>
> Can you share these w/ Ruiting sometime soon?
>
> thanks!
>
>
>
>
>
> On Thu, May 11, 2017 at 6:34 AM, Linas Vepstas <[email protected]>
> wrote:
> > Attached PDF reports on a small, early snapshot of what the database
> looks
> > like. Basically, it looks promising.  I'm moving on to the next step,
> which
> > is to reparse with the clusters. There's various parts of the theory I
> don't
> > understand, as well as a lot of code to write, to build a pipeline from
> the
> > atomspace back into the link-grammar parser.
> >
> > --linas
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
> boundary, I am the peak." -- Alexander Scriabin
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36-Xr19qUud7kYLmiqURuhNWtDgg2%2BXZ1yzQz9VzLimSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Word similarity database report

Reply via email to