Very cool!

Ruiting should be ready to start playing w/ this data on Tuesday, I think...

On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]> wrote:
> Ben, Ruiting,
>
> For your enjoyment: I have some very preliminary results on word similarity.
> They look pretty nice, even thogh based on a fairly small number of
> observations.
>
> If you've been watching TV instead of reading email, here's the story so
> far: Starting from a large text corpus, the mutual information (MI) of
> word-pairs are counted. This MI is used to perform a maximum spanning-tree
> (MST) parse (of a different subset of) the corpus. From each parse, a
> pseudo-disjunct is extracted for each word.  The pseudo-disjunct is like a
> real LG disjunct, except that each connector in the disjunct is the word at
> the far end of the link.
>
> So, for example, in in idealized world, the MST parse of the sentence "Ben
> ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this,
> we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".
> Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben-
> pizza+) on the word "puke".  Since these two  disjuncts are the same, we can
> conclude that the two words "ate" and "puke" are very similar to each other.
> Considering all of the other disjuncts that arise in this example, we can
> conclude that these are the only two words that are similar.
>
> Note that a given word may have very many pseudo-disjuncts attached to it.
> Each disjunct has a count of the number of times it has been observed.
> Thus, this set of disjuncts can be imagined to be a vector in a
> high-dimensional vector space, which each disjunct being a single basis
> element.  The similarity of two words can be taken to be the
> cosine-similarity between the disjunct-vectors (or pick another, different
> metric, as you please.)
>
> Below are a set of examples, for English, on a somewhat small
> dataset.Collected over a few days, it contains just under half-a-million
> observations of disjuncts, distributed across about 30K words. Thus, most
> words will have only a couple of disjuncts on them, which may have been seen
> only a couple of times. its important, at this stage, to limit oneself to
> only the most popular words.
>
> We expect the determiners "the" and "a" to be similar, and they are:
> (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141
>
> Even more similar:
> (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755
>
> Not very similar at all:
> (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119
>
> Oh hey this and that are similar. Notice the triangle with "the".
> (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977
>
> Some more results
>  (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
>  (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
>  (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158
>
> We expect that determiners, nouns and verbs to all be very different
> from one-another. And they are:
>  (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
>  (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
>  (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0
>
> We expect verbs to be similar, and they sort-of are.
>  (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
>  (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603
>
> Since this is a sampling from wikipedia, there will be very few "action"
> verbs, unless the sample accidentally contains articles about sports. A
> "common sense" corpus, or a corpus that talks about what people do,
> could/should improve the above verbs.  These are very basic to human
> behavior, but are rare in most writing.
>
> I'm thinking that a corpus of children's lit, and young-adult-lit would be
> much better for these kinds of things.
>
> An adjective.
>  (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
>  (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
>  (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
>  (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385
>
>  (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278
>
> Here's a set of antonyms!
>  (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773
>
> A pronoun
>  (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
>  (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417
>
>  (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382
>
> Wow!! In English, "it" is usually a male!
>  (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
>  (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
>  (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902
>
> I can post the database on mondy, let me know when you're ready to
> receive it.
>
> --linas
>
>



-- 
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBfJTSJUACyqKWTz7R0TsNGhgFuL59RUChZYj%2BBfnjCW5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to