Ben, Ruiting, For your enjoyment: I have some very preliminary results on word similarity. They look pretty nice, even thogh based on a fairly small number of observations.
If you've been watching TV instead of reading email, here's the story so far: Starting from a large text corpus, the mutual information (MI) of word-pairs are counted. This MI is used to perform a maximum spanning-tree (MST) parse (of a different subset of) the corpus. From each parse, a pseudo-disjunct is extracted for each word. The pseudo-disjunct is like a real LG disjunct, except that each connector in the disjunct is the word at the far end of the link. So, for example, in in idealized world, the MST parse of the sentence "Ben ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from this, we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate". Similarly, the sentence "Ben puked pizza" should produce the disjunct (Ben- pizza+) on the word "puke". Since these two disjuncts are the same, we can conclude that the two words "ate" and "puke" are very similar to each other. Considering all of the other disjuncts that arise in this example, we can conclude that these are the only two words that are similar. Note that a given word may have very many pseudo-disjuncts attached to it. Each disjunct has a count of the number of times it has been observed. Thus, this set of disjuncts can be imagined to be a vector in a high-dimensional vector space, which each disjunct being a single basis element. The similarity of two words can be taken to be the cosine-similarity between the disjunct-vectors (or pick another, different metric, as you please.) Below are a set of examples, for English, on a somewhat small dataset.Collected over a few days, it contains just under half-a-million observations of disjuncts, distributed across about 30K words. Thus, most words will have only a couple of disjuncts on them, which may have been seen only a couple of times. its important, at this stage, to limit oneself to only the most popular words. We expect the determiners "the" and "a" to be similar, and they are: (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141 Even more similar: (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755 Not very similar at all: (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119 Oh hey this and that are similar. Notice the triangle with "the". (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977 Some more results (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984 (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773 (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158 We expect that determiners, nouns and verbs to all be very different from one-another. And they are: (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4 (cset-vec-cosine (Word "the") (Word "jump")) = 0.0 (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0 We expect verbs to be similar, and they sort-of are. (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128 (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603 Since this is a sampling from wikipedia, there will be very few "action" verbs, unless the sample accidentally contains articles about sports. A "common sense" corpus, or a corpus that talks about what people do, could/should improve the above verbs. These are very basic to human behavior, but are rare in most writing. I'm thinking that a corpus of children's lit, and young-adult-lit would be much better for these kinds of things. An adjective. (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494 (cset-vec-cosine (Word "wide") (Word "look")) = 0.0 (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126 (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385 (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278 Here's a set of antonyms! (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773 A pronoun (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233 (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417 (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382 Wow!! In English, "it" is usually a male! (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482 (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214 (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902 I can post the database on mondy, let me know when you're ready to receive it. --linas -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA371BxbyuAGbJ7c7%3DJ8GVUR%2B9QNwxeUmWgHsP-BadfYD_A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
