Very cool! Ruiting should be ready to start playing w/ this data on Tuesday, I think...
On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]> wrote: > Ben, Ruiting, > > For your enjoyment: I have some very preliminary results on word similarity. > They look pretty nice, even thogh based on a fairly small number of > observations. > > If you've been watching TV instead of reading email, here's the story so > far: Starting from a large text corpus, the mutual information (MI) of > word-pairs are counted. This MI is used to perform a maximum spanning-tree > (MST) parse (of a different subset of) the corpus. From each parse, a > pseudo-disjunct is extracted for each word. The pseudo-disjunct is like a > real LG disjunct, except that each connector in the disjunct is the word at > the far end of the link. > > So, for example, in in idealized world, the MST parse of the sentence "Ben > ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this, > we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate". > Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben- > pizza+) on the word "puke". Since these two disjuncts are the same, we can > conclude that the two words "ate" and "puke" are very similar to each other. > Considering all of the other disjuncts that arise in this example, we can > conclude that these are the only two words that are similar. > > Note that a given word may have very many pseudo-disjuncts attached to it. > Each disjunct has a count of the number of times it has been observed. > Thus, this set of disjuncts can be imagined to be a vector in a > high-dimensional vector space, which each disjunct being a single basis > element. The similarity of two words can be taken to be the > cosine-similarity between the disjunct-vectors (or pick another, different > metric, as you please.) > > Below are a set of examples, for English, on a somewhat small > dataset.Collected over a few days, it contains just under half-a-million > observations of disjuncts, distributed across about 30K words. Thus, most > words will have only a couple of disjuncts on them, which may have been seen > only a couple of times. its important, at this stage, to limit oneself to > only the most popular words. > > We expect the determiners "the" and "a" to be similar, and they are: > (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141 > > Even more similar: > (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755 > > Not very similar at all: > (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119 > > Oh hey this and that are similar. Notice the triangle with "the". > (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977 > > Some more results > (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984 > (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773 > (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158 > > We expect that determiners, nouns and verbs to all be very different > from one-another. And they are: > (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4 > (cset-vec-cosine (Word "the") (Word "jump")) = 0.0 > (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0 > > We expect verbs to be similar, and they sort-of are. > (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128 > (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603 > > Since this is a sampling from wikipedia, there will be very few "action" > verbs, unless the sample accidentally contains articles about sports. A > "common sense" corpus, or a corpus that talks about what people do, > could/should improve the above verbs. These are very basic to human > behavior, but are rare in most writing. > > I'm thinking that a corpus of children's lit, and young-adult-lit would be > much better for these kinds of things. > > An adjective. > (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494 > (cset-vec-cosine (Word "wide") (Word "look")) = 0.0 > (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126 > (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385 > > (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278 > > Here's a set of antonyms! > (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773 > > A pronoun > (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233 > (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417 > > (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382 > > Wow!! In English, "it" is usually a male! > (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482 > (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214 > (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902 > > I can post the database on mondy, let me know when you're ready to > receive it. > > --linas > > -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBfJTSJUACyqKWTz7R0TsNGhgFuL59RUChZYj%2BBfnjCW5w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
