[opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

Ben Goertzel Sun, 07 May 2017 00:27:59 -0700

Hmm.. it's hard to know what distance measure is best without playing
w. the data first


Jaccard and Tanimoto similarity (the latter not quite corresponding to
a metric) may be useful, I dunno...

https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance

Some clustering methods will be able to use these precomputed
distances; others (like NN-based methods) sorta compute distances in
the midst of doing their other stuff anyway...



On Sun, May 7, 2017 at 3:22 PM, Linas Vepstas <[email protected]> wrote:
> OK. Close coordination will be needed. I'm planning on creating a database
> with several different kinds of distance measures precomputed.  This is
> possible, because the database is currently small enough to make this
> possible.
>
> Any favorite distance measures you might recommend, besides the cosine
> distance?
>
> --linas
>
> On Sat, May 6, 2017 at 10:25 PM, Ben Goertzel <[email protected]> wrote:
>>
>> Very cool!
>>
>> Ruiting should be ready to start playing w/ this data on Tuesday, I
>> think...
>>
>> On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]>
>> wrote:
>> > Ben, Ruiting,
>> >
>> > For your enjoyment: I have some very preliminary results on word
>> > similarity.
>> > They look pretty nice, even thogh based on a fairly small number of
>> > observations.
>> >
>> > If you've been watching TV instead of reading email, here's the story so
>> > far: Starting from a large text corpus, the mutual information (MI) of
>> > word-pairs are counted. This MI is used to perform a maximum
>> > spanning-tree
>> > (MST) parse (of a different subset of) the corpus. From each parse, a
>> > pseudo-disjunct is extracted for each word.  The pseudo-disjunct is like
>> > a
>> > real LG disjunct, except that each connector in the disjunct is the word
>> > at
>> > the far end of the link.
>> >
>> > So, for example, in in idealized world, the MST parse of the sentence
>> > "Ben
>> > ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from
>> > this,
>> > we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".
>> > Similarly, the sentence "Ben puked pizza" should produce the disjunct
>> > (Ben-
>> > pizza+) on the word "puke".  Since these two  disjuncts are the same, we
>> > can
>> > conclude that the two words "ate" and "puke" are very similar to each
>> > other.
>> > Considering all of the other disjuncts that arise in this example, we
>> > can
>> > conclude that these are the only two words that are similar.
>> >
>> > Note that a given word may have very many pseudo-disjuncts attached to
>> > it.
>> > Each disjunct has a count of the number of times it has been observed.
>> > Thus, this set of disjuncts can be imagined to be a vector in a
>> > high-dimensional vector space, which each disjunct being a single basis
>> > element.  The similarity of two words can be taken to be the
>> > cosine-similarity between the disjunct-vectors (or pick another,
>> > different
>> > metric, as you please.)
>> >
>> > Below are a set of examples, for English, on a somewhat small
>> > dataset.Collected over a few days, it contains just under half-a-million
>> > observations of disjuncts, distributed across about 30K words. Thus,
>> > most
>> > words will have only a couple of disjuncts on them, which may have been
>> > seen
>> > only a couple of times. its important, at this stage, to limit oneself
>> > to
>> > only the most popular words.
>> >
>> > We expect the determiners "the" and "a" to be similar, and they are:
>> > (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141
>> >
>> > Even more similar:
>> > (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755
>> >
>> > Not very similar at all:
>> > (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119
>> >
>> > Oh hey this and that are similar. Notice the triangle with "the".
>> > (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977
>> >
>> > Some more results
>> >  (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
>> >  (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
>> >  (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158
>> >
>> > We expect that determiners, nouns and verbs to all be very different
>> > from one-another. And they are:
>> >  (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
>> >  (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
>> >  (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0
>> >
>> > We expect verbs to be similar, and they sort-of are.
>> >  (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
>> >  (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603
>> >
>> > Since this is a sampling from wikipedia, there will be very few "action"
>> > verbs, unless the sample accidentally contains articles about sports. A
>> > "common sense" corpus, or a corpus that talks about what people do,
>> > could/should improve the above verbs.  These are very basic to human
>> > behavior, but are rare in most writing.
>> >
>> > I'm thinking that a corpus of children's lit, and young-adult-lit would
>> > be
>> > much better for these kinds of things.
>> >
>> > An adjective.
>> >  (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
>> >  (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
>> >  (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
>> >  (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385
>> >
>> >  (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278
>> >
>> > Here's a set of antonyms!
>> >  (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773
>> >
>> > A pronoun
>> >  (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
>> >  (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417
>> >
>> >  (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382
>> >
>> > Wow!! In English, "it" is usually a male!
>> >  (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
>> >  (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
>> >  (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902
>> >
>> > I can post the database on mondy, let me know when you're ready to
>> > receive it.
>> >
>> > --linas
>> >
>> >
>>
>>
>>
>> --
>> Ben Goertzel, PhD
>> http://goertzel.org
>>
>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>> boundary, I am the peak." -- Alexander Scriabin
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "link-grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/link-grammar.
> For more options, visit https://groups.google.com/d/optout.



-- 
Ben Goertzel, PhD
http://goertzel.org

"I am God! I am nothing, I'm play, I am freedom, I am life. I am the
boundary, I am the peak." -- Alexander Scriabin

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBffAwky84hNMJbw-RT0xOOx8Vus%3D_-b0vO1rrVW-1CQzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

Reply via email to