[opencog-dev] Word similarity results; database almost ready

Linas Vepstas Sat, 06 May 2017 13:11:59 -0700

Ben, Ruiting,

For your enjoyment: I have some very preliminary results on word
similarity.  They look pretty nice, even thogh based on a fairly small
number of observations.


If you've been watching TV instead of reading email, here's the story so
far: Starting from a large text corpus, the mutual information (MI) of
word-pairs are counted. This MI is used to perform a maximum spanning-tree
(MST) parse (of a different subset of) the corpus. From each parse, a
pseudo-disjunct is extracted for each word.  The pseudo-disjunct is like a
real LG disjunct, except that each connector in the disjunct is the word at
the far end of the link.

So, for example, in in idealized world, the MST parse of the sentence "Ben
ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this,
we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".
Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben-
pizza+) on the word "puke".  Since these two  disjuncts are the same, we
can conclude that the two words "ate" and "puke" are very similar to each
other.  Considering all of the other disjuncts that arise in this example,
we can conclude that these are the only two words that are similar.

Note that a given word may have very many pseudo-disjuncts attached to it.
Each disjunct has a count of the number of times it has been observed.
Thus, this set of disjuncts can be imagined to be a vector in a
high-dimensional vector space, which each disjunct being a single basis
element.  The similarity of two words can be taken to be the
cosine-similarity between the disjunct-vectors (or pick another, different
metric, as you please.)

Below are a set of examples, for English, on a somewhat small
dataset.Collected over a few days, it contains just under half-a-million
observations of disjuncts, distributed across about 30K words. Thus, most
words will have only a couple of disjuncts on them, which may have been
seen only a couple of times. its important, at this stage, to limit oneself
to only the most popular words.

We expect the determiners "the" and "a" to be similar, and they are:
(cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141

Even more similar:
(cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755

Not very similar at all:
(cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119

Oh hey this and that are similar. Notice the triangle with "the".
(cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977

Some more results
 (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
 (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
 (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158

We expect that determiners, nouns and verbs to all be very different
from one-another. And they are:
 (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
 (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
 (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0

We expect verbs to be similar, and they sort-of are.
 (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
 (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603

Since this is a sampling from wikipedia, there will be very few "action"
verbs, unless the sample accidentally contains articles about sports. A
"common sense" corpus, or a corpus that talks about what people do,
could/should improve the above verbs.  These are very basic to human
behavior, but are rare in most writing.

I'm thinking that a corpus of children's lit, and young-adult-lit would be
much better for these kinds of things.

An adjective.
 (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
 (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
 (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
 (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385

 (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278

Here's a set of antonyms!
 (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773

A pronoun
 (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
 (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417

 (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382

Wow!! In English, "it" is usually a male!
 (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
 (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
 (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902

I can post the database on mondy, let me know when you're ready to
receive it.

--linas

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA371BxbyuAGbJ7c7%3DJ8GVUR%2B9QNwxeUmWgHsP-BadfYD_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Word similarity results; database almost ready

Reply via email to