[opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

'Nil Geisweiller' via opencog Sun, 07 May 2017 01:21:33 -0700

There are a few distances defined in cogutils (if that helps)


https://github.com/opencog/cogutils/blob/master/opencog/util/jaccard_index.h#L35

https://github.com/opencog/cogutils/blob/master/opencog/util/numeric.h#L405

https://github.com/opencog/cogutils/blob/master/opencog/util/numeric.h#L472

https://github.com/opencog/cogutils/blob/master/opencog/util/numeric.h#L509

Nil

On 05/07/2017 10:27 AM, Ben Goertzel wrote:

Hmm.. it's hard to know what distance measure is best without playing
w. the data first

Jaccard and Tanimoto similarity (the latter not quite corresponding to
a metric) may be useful, I dunno...

https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance

Some clustering methods will be able to use these precomputed
distances; others (like NN-based methods) sorta compute distances in
the midst of doing their other stuff anyway...



On Sun, May 7, 2017 at 3:22 PM, Linas Vepstas <[email protected]> wrote:

OK. Close coordination will be needed. I'm planning on creating a database
with several different kinds of distance measures precomputed.  This is
possible, because the database is currently small enough to make this
possible.

Any favorite distance measures you might recommend, besides the cosine
distance?

--linas

On Sat, May 6, 2017 at 10:25 PM, Ben Goertzel <[email protected]> wrote:


Very cool!

Ruiting should be ready to start playing w/ this data on Tuesday, I
think...

On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]>
wrote:

Ben, Ruiting,

For your enjoyment: I have some very preliminary results on word
similarity.
They look pretty nice, even thogh based on a fairly small number of
observations.

If you've been watching TV instead of reading email, here's the story so
far: Starting from a large text corpus, the mutual information (MI) of
word-pairs are counted. This MI is used to perform a maximum
spanning-tree
(MST) parse (of a different subset of) the corpus. From each parse, a
pseudo-disjunct is extracted for each word.  The pseudo-disjunct is like
a
real LG disjunct, except that each connector in the disjunct is the word
at
the far end of the link.

So, for example, in in idealized world, the MST parse of the sentence
"Ben
ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from
this,
we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".
Similarly, the sentence "Ben puked pizza" should produce the disjunct
(Ben-
pizza+) on the word "puke".  Since these two  disjuncts are the same, we
can
conclude that the two words "ate" and "puke" are very similar to each
other.
Considering all of the other disjuncts that arise in this example, we
can
conclude that these are the only two words that are similar.

Note that a given word may have very many pseudo-disjuncts attached to
it.
Each disjunct has a count of the number of times it has been observed.
Thus, this set of disjuncts can be imagined to be a vector in a
high-dimensional vector space, which each disjunct being a single basis
element.  The similarity of two words can be taken to be the
cosine-similarity between the disjunct-vectors (or pick another,
different
metric, as you please.)

Below are a set of examples, for English, on a somewhat small
dataset.Collected over a few days, it contains just under half-a-million
observations of disjuncts, distributed across about 30K words. Thus,
most
words will have only a couple of disjuncts on them, which may have been
seen
only a couple of times. its important, at this stage, to limit oneself
to
only the most popular words.

We expect the determiners "the" and "a" to be similar, and they are:
(cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141

Even more similar:
(cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755

Not very similar at all:
(cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119

Oh hey this and that are similar. Notice the triangle with "the".
(cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977

Some more results
 (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
 (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
 (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158

We expect that determiners, nouns and verbs to all be very different
from one-another. And they are:
 (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
 (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
 (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0

We expect verbs to be similar, and they sort-of are.
 (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
 (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603

Since this is a sampling from wikipedia, there will be very few "action"
verbs, unless the sample accidentally contains articles about sports. A
"common sense" corpus, or a corpus that talks about what people do,
could/should improve the above verbs.  These are very basic to human
behavior, but are rare in most writing.

I'm thinking that a corpus of children's lit, and young-adult-lit would
be
much better for these kinds of things.

An adjective.
 (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
 (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
 (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
 (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385

 (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278

Here's a set of antonyms!
 (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773

A pronoun
 (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
 (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417

 (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382

Wow!! In English, "it" is usually a male!
 (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
 (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
 (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902

I can post the database on mondy, let me know when you're ready to
receive it.

--linas




--
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin



--
You received this message because you are subscribed to the Google Groups
"link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/link-grammar.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/c4100a00-4dc0-5391-1ebe-20969f4e7e8b%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

Reply via email to