OK. Close coordination will be needed. I'm planning on creating a database with several different kinds of distance measures precomputed. This is possible, because the database is currently small enough to make this possible.
Any favorite distance measures you might recommend, besides the cosine distance? --linas On Sat, May 6, 2017 at 10:25 PM, Ben Goertzel <[email protected]> wrote: > Very cool! > > Ruiting should be ready to start playing w/ this data on Tuesday, I > think... > > On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]> > wrote: > > Ben, Ruiting, > > > > For your enjoyment: I have some very preliminary results on word > similarity. > > They look pretty nice, even thogh based on a fairly small number of > > observations. > > > > If you've been watching TV instead of reading email, here's the story so > > far: Starting from a large text corpus, the mutual information (MI) of > > word-pairs are counted. This MI is used to perform a maximum > spanning-tree > > (MST) parse (of a different subset of) the corpus. From each parse, a > > pseudo-disjunct is extracted for each word. The pseudo-disjunct is like > a > > real LG disjunct, except that each connector in the disjunct is the word > at > > the far end of the link. > > > > So, for example, in in idealized world, the MST parse of the sentence > "Ben > > ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from > this, > > we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate". > > Similarly, the sentence "Ben puked pizza" should produce the disjunct > (Ben- > > pizza+) on the word "puke". Since these two disjuncts are the same, we > can > > conclude that the two words "ate" and "puke" are very similar to each > other. > > Considering all of the other disjuncts that arise in this example, we can > > conclude that these are the only two words that are similar. > > > > Note that a given word may have very many pseudo-disjuncts attached to > it. > > Each disjunct has a count of the number of times it has been observed. > > Thus, this set of disjuncts can be imagined to be a vector in a > > high-dimensional vector space, which each disjunct being a single basis > > element. The similarity of two words can be taken to be the > > cosine-similarity between the disjunct-vectors (or pick another, > different > > metric, as you please.) > > > > Below are a set of examples, for English, on a somewhat small > > dataset.Collected over a few days, it contains just under half-a-million > > observations of disjuncts, distributed across about 30K words. Thus, most > > words will have only a couple of disjuncts on them, which may have been > seen > > only a couple of times. its important, at this stage, to limit oneself to > > only the most popular words. > > > > We expect the determiners "the" and "a" to be similar, and they are: > > (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141 > > > > Even more similar: > > (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755 > > > > Not very similar at all: > > (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119 > > > > Oh hey this and that are similar. Notice the triangle with "the". > > (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977 > > > > Some more results > > (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984 > > (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773 > > (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158 > > > > We expect that determiners, nouns and verbs to all be very different > > from one-another. And they are: > > (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4 > > (cset-vec-cosine (Word "the") (Word "jump")) = 0.0 > > (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0 > > > > We expect verbs to be similar, and they sort-of are. > > (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128 > > (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603 > > > > Since this is a sampling from wikipedia, there will be very few "action" > > verbs, unless the sample accidentally contains articles about sports. A > > "common sense" corpus, or a corpus that talks about what people do, > > could/should improve the above verbs. These are very basic to human > > behavior, but are rare in most writing. > > > > I'm thinking that a corpus of children's lit, and young-adult-lit would > be > > much better for these kinds of things. > > > > An adjective. > > (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494 > > (cset-vec-cosine (Word "wide") (Word "look")) = 0.0 > > (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126 > > (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385 > > > > (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278 > > > > Here's a set of antonyms! > > (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773 > > > > A pronoun > > (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233 > > (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417 > > > > (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382 > > > > Wow!! In English, "it" is usually a male! > > (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482 > > (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214 > > (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902 > > > > I can post the database on mondy, let me know when you're ready to > > receive it. > > > > --linas > > > > > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37mXSCcqsPL7gJwPEEiSouFh8Y8yP_8-awCk%2BRfTzwYUQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
