Re: [opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

Linas Vepstas Tue, 09 May 2017 14:42:12 -0700

Hi Nil,

this is very off-topic, but it illustrates the problem of coding in C++:
its hard and sometimes impossible to untangle algorithm and data
structure.  This point is made clearly in the Bondi language, which is an
experimental language that tries to clearly separate these two.


Historically, lisp/scheme were much better at separating algo from data,
which is why they were popular in early AI attempts, and in early web
shopping-carts.  The whole point of adding templates to C++ was to at least
partly solve this problem, but c++ templates remain hard to use in any but
the very simplest situations. basically, c++ templates are like a
badly-broken hard-to-use version of lisp. (but with types, so I guess like
haskel/caml)

Another common solution for OO programming in python, C++ is the "visitor
pattern" which you don't see in lisp or scheme, because everything is a
visitor, there. visitors show up in chapter 2 of sicp, they are so basic
that they are not even given a special name.

in my case, the different counts for different atoms are stored in
different places, and they're never vectors- they're just usually random
sets of gorp that earlier layers generated.

My quick-hack, non-generic-programming approach is here

banach lp-distance:

https://github.com/opencog/opencog/blob/master/opencog/nlp/learn/pseudo-csets.scm#L217-L234

vector product:
https://github.com/opencog/opencog/blob/master/opencog/nlp/learn/pseudo-csets.scm#L386-L414

Both are more complicated than they need to be, because the indicated data
item might not exist. e.g. only one in a trillion possible disjuncts will
ever exist. so these "vectors" are actually unordered sets and they are
extremely sparse.

--linas

On Sun, May 7, 2017 at 3:21 AM, 'Nil Geisweiller' via opencog <
[email protected]> wrote:

> There are a few distances defined in cogutils (if that helps)
>
> https://github.com/opencog/cogutils/blob/master/opencog/util
> /jaccard_index.h#L35
>
> https://github.com/opencog/cogutils/blob/master/opencog/util
> /numeric.h#L405
>
> https://github.com/opencog/cogutils/blob/master/opencog/util
> /numeric.h#L472
>
> https://github.com/opencog/cogutils/blob/master/opencog/util
> /numeric.h#L509
>
> Nil
>
>
> On 05/07/2017 10:27 AM, Ben Goertzel wrote:
>
>> Hmm.. it's hard to know what distance measure is best without playing
>> w. the data first
>>
>> Jaccard and Tanimoto similarity (the latter not quite corresponding to
>> a metric) may be useful, I dunno...
>>
>> https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jacc
>> ard_similarity_and_distance
>>
>> Some clustering methods will be able to use these precomputed
>> distances; others (like NN-based methods) sorta compute distances in
>> the midst of doing their other stuff anyway...
>>
>>
>>
>> On Sun, May 7, 2017 at 3:22 PM, Linas Vepstas <[email protected]>
>> wrote:
>>
>>> OK. Close coordination will be needed. I'm planning on creating a
>>> database
>>> with several different kinds of distance measures precomputed.  This is
>>> possible, because the database is currently small enough to make this
>>> possible.
>>>
>>> Any favorite distance measures you might recommend, besides the cosine
>>> distance?
>>>
>>> --linas
>>>
>>> On Sat, May 6, 2017 at 10:25 PM, Ben Goertzel <[email protected]> wrote:
>>>
>>>>
>>>> Very cool!
>>>>
>>>> Ruiting should be ready to start playing w/ this data on Tuesday, I
>>>> think...
>>>>
>>>> On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]>
>>>> wrote:
>>>>
>>>>> Ben, Ruiting,
>>>>>
>>>>> For your enjoyment: I have some very preliminary results on word
>>>>> similarity.
>>>>> They look pretty nice, even thogh based on a fairly small number of
>>>>> observations.
>>>>>
>>>>> If you've been watching TV instead of reading email, here's the story
>>>>> so
>>>>> far: Starting from a large text corpus, the mutual information (MI) of
>>>>> word-pairs are counted. This MI is used to perform a maximum
>>>>> spanning-tree
>>>>> (MST) parse (of a different subset of) the corpus. From each parse, a
>>>>> pseudo-disjunct is extracted for each word.  The pseudo-disjunct is
>>>>> like
>>>>> a
>>>>> real LG disjunct, except that each connector in the disjunct is the
>>>>> word
>>>>> at
>>>>> the far end of the link.
>>>>>
>>>>> So, for example, in in idealized world, the MST parse of the sentence
>>>>> "Ben
>>>>> ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from
>>>>> this,
>>>>> we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate".
>>>>> Similarly, the sentence "Ben puked pizza" should produce the disjunct
>>>>> (Ben-
>>>>> pizza+) on the word "puke".  Since these two  disjuncts are the same,
>>>>> we
>>>>> can
>>>>> conclude that the two words "ate" and "puke" are very similar to each
>>>>> other.
>>>>> Considering all of the other disjuncts that arise in this example, we
>>>>> can
>>>>> conclude that these are the only two words that are similar.
>>>>>
>>>>> Note that a given word may have very many pseudo-disjuncts attached to
>>>>> it.
>>>>> Each disjunct has a count of the number of times it has been observed.
>>>>> Thus, this set of disjuncts can be imagined to be a vector in a
>>>>> high-dimensional vector space, which each disjunct being a single basis
>>>>> element.  The similarity of two words can be taken to be the
>>>>> cosine-similarity between the disjunct-vectors (or pick another,
>>>>> different
>>>>> metric, as you please.)
>>>>>
>>>>> Below are a set of examples, for English, on a somewhat small
>>>>> dataset.Collected over a few days, it contains just under
>>>>> half-a-million
>>>>> observations of disjuncts, distributed across about 30K words. Thus,
>>>>> most
>>>>> words will have only a couple of disjuncts on them, which may have been
>>>>> seen
>>>>> only a couple of times. its important, at this stage, to limit oneself
>>>>> to
>>>>> only the most popular words.
>>>>>
>>>>> We expect the determiners "the" and "a" to be similar, and they are:
>>>>> (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141
>>>>>
>>>>> Even more similar:
>>>>> (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755
>>>>>
>>>>> Not very similar at all:
>>>>> (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119
>>>>>
>>>>> Oh hey this and that are similar. Notice the triangle with "the".
>>>>> (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977
>>>>>
>>>>> Some more results
>>>>>  (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984
>>>>>  (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773
>>>>>  (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158
>>>>>
>>>>> We expect that determiners, nouns and verbs to all be very different
>>>>> from one-another. And they are:
>>>>>  (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4
>>>>>  (cset-vec-cosine (Word "the") (Word "jump")) = 0.0
>>>>>  (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0
>>>>>
>>>>> We expect verbs to be similar, and they sort-of are.
>>>>>  (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128
>>>>>  (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603
>>>>>
>>>>> Since this is a sampling from wikipedia, there will be very few
>>>>> "action"
>>>>> verbs, unless the sample accidentally contains articles about sports. A
>>>>> "common sense" corpus, or a corpus that talks about what people do,
>>>>> could/should improve the above verbs.  These are very basic to human
>>>>> behavior, but are rare in most writing.
>>>>>
>>>>> I'm thinking that a corpus of children's lit, and young-adult-lit would
>>>>> be
>>>>> much better for these kinds of things.
>>>>>
>>>>> An adjective.
>>>>>  (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494
>>>>>  (cset-vec-cosine (Word "wide") (Word "look")) = 0.0
>>>>>  (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126
>>>>>  (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385
>>>>>
>>>>>  (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278
>>>>>
>>>>> Here's a set of antonyms!
>>>>>  (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773
>>>>>
>>>>> A pronoun
>>>>>  (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233
>>>>>  (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417
>>>>>
>>>>>  (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382
>>>>>
>>>>> Wow!! In English, "it" is usually a male!
>>>>>  (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482
>>>>>  (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214
>>>>>  (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902
>>>>>
>>>>> I can post the database on mondy, let me know when you're ready to
>>>>> receive it.
>>>>>
>>>>> --linas
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ben Goertzel, PhD
>>>> http://goertzel.org
>>>>
>>>> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
>>>> boundary, I am the peak." -- Alexander Scriabin
>>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "link-grammar" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/link-grammar.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/opencog/c4100a00-4dc0-5391-1ebe-20969f4e7e8b%40gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA378zHSM0TcFmaOJ7K0K%2BugR5V56iMrFKrd6Yub3rr3pdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [opencog-dev] Re: [Link Grammar] Re: Word similarity results; database almost ready

Reply via email to