Hmm.. it's hard to know what distance measure is best without playing w. the data first
Jaccard and Tanimoto similarity (the latter not quite corresponding to a metric) may be useful, I dunno... https://en.wikipedia.org/wiki/Jaccard_index#Generalized_Jaccard_similarity_and_distance Some clustering methods will be able to use these precomputed distances; others (like NN-based methods) sorta compute distances in the midst of doing their other stuff anyway... On Sun, May 7, 2017 at 3:22 PM, Linas Vepstas <[email protected]> wrote: > OK. Close coordination will be needed. I'm planning on creating a database > with several different kinds of distance measures precomputed. This is > possible, because the database is currently small enough to make this > possible. > > Any favorite distance measures you might recommend, besides the cosine > distance? > > --linas > > On Sat, May 6, 2017 at 10:25 PM, Ben Goertzel <[email protected]> wrote: >> >> Very cool! >> >> Ruiting should be ready to start playing w/ this data on Tuesday, I >> think... >> >> On Sun, May 7, 2017 at 4:11 AM, Linas Vepstas <[email protected]> >> wrote: >> > Ben, Ruiting, >> > >> > For your enjoyment: I have some very preliminary results on word >> > similarity. >> > They look pretty nice, even thogh based on a fairly small number of >> > observations. >> > >> > If you've been watching TV instead of reading email, here's the story so >> > far: Starting from a large text corpus, the mutual information (MI) of >> > word-pairs are counted. This MI is used to perform a maximum >> > spanning-tree >> > (MST) parse (of a different subset of) the corpus. From each parse, a >> > pseudo-disjunct is extracted for each word. The pseudo-disjunct is like >> > a >> > real LG disjunct, except that each connector in the disjunct is the word >> > at >> > the far end of the link. >> > >> > So, for example, in in idealized world, the MST parse of the sentence >> > "Ben >> > ate pizza" would prodouce the parse Ben <--> ate <--> pizza and from >> > this, >> > we can extract the psuedo-disjunct (Ben- pizza+) on the word "ate". >> > Similarly, the sentence "Ben puked pizza" should produce the disjunct >> > (Ben- >> > pizza+) on the word "puke". Since these two disjuncts are the same, we >> > can >> > conclude that the two words "ate" and "puke" are very similar to each >> > other. >> > Considering all of the other disjuncts that arise in this example, we >> > can >> > conclude that these are the only two words that are similar. >> > >> > Note that a given word may have very many pseudo-disjuncts attached to >> > it. >> > Each disjunct has a count of the number of times it has been observed. >> > Thus, this set of disjuncts can be imagined to be a vector in a >> > high-dimensional vector space, which each disjunct being a single basis >> > element. The similarity of two words can be taken to be the >> > cosine-similarity between the disjunct-vectors (or pick another, >> > different >> > metric, as you please.) >> > >> > Below are a set of examples, for English, on a somewhat small >> > dataset.Collected over a few days, it contains just under half-a-million >> > observations of disjuncts, distributed across about 30K words. Thus, >> > most >> > words will have only a couple of disjuncts on them, which may have been >> > seen >> > only a couple of times. its important, at this stage, to limit oneself >> > to >> > only the most popular words. >> > >> > We expect the determiners "the" and "a" to be similar, and they are: >> > (cset-vec-cosine (Word "the") (Word "a")) = 0.1554007744026141 >> > >> > Even more similar: >> > (cset-vec-cosine (Word "the") (Word "this")) = 0.3083725359820755 >> > >> > Not very similar at all: >> > (cset-vec-cosine (Word "the") (Word "that")) = 0.01981486048876119 >> > >> > Oh hey this and that are similar. Notice the triangle with "the". >> > (cset-vec-cosine (Word "this") (Word "that")) = 0.14342403062507977 >> > >> > Some more results >> > (cset-vec-cosine (Word "this") (Word "these")) = 0.23100101197144984 >> > (cset-vec-cosine (Word "this") (Word "those")) = 0.1099725424243773 >> > (cset-vec-cosine (Word "these") (Word "those")) = 0.13577971016706158 >> > >> > We expect that determiners, nouns and verbs to all be very different >> > from one-another. And they are: >> > (cset-vec-cosine (Word "the") (Word "ball")) = 2.3964597196461594e-4 >> > (cset-vec-cosine (Word "the") (Word "jump")) = 0.0 >> > (cset-vec-cosine (Word "ball") (Word "jump")) = 0.0 >> > >> > We expect verbs to be similar, and they sort-of are. >> > (cset-vec-cosine (Word "run") (Word "jump")) = 0.05184758473652128 >> > (cset-vec-cosine (Word "run") (Word "look")) = 0.05283524652572603 >> > >> > Since this is a sampling from wikipedia, there will be very few "action" >> > verbs, unless the sample accidentally contains articles about sports. A >> > "common sense" corpus, or a corpus that talks about what people do, >> > could/should improve the above verbs. These are very basic to human >> > behavior, but are rare in most writing. >> > >> > I'm thinking that a corpus of children's lit, and young-adult-lit would >> > be >> > much better for these kinds of things. >> > >> > An adjective. >> > (cset-vec-cosine (Word "wide") (Word "narrow")) = 0.06262242910851494 >> > (cset-vec-cosine (Word "wide") (Word "look")) = 0.0 >> > (cset-vec-cosine (Word "wide") (Word "ball")) = 0.02449979787750126 >> > (cset-vec-cosine (Word "wide") (Word "the")) = 0.04718158900583385 >> > >> > (cset-vec-cosine (Word "heavy") (Word "wide")) = 0.05752237416355278 >> > >> > Here's a set of antonyms! >> > (cset-vec-cosine (Word "heavy") (Word "light")) = 0.16760038078849773 >> > >> > A pronoun >> > (cset-vec-cosine (Word "ball") (Word "it")) = 0.009201177048960233 >> > (cset-vec-cosine (Word "wide") (Word "it")) = 0.005522960959398417 >> > >> > (cset-vec-cosine (Word "the") (Word "it")) = 0.01926824360790382 >> > >> > Wow!! In English, "it" is usually a male! >> > (cset-vec-cosine (Word "it") (Word "she")) = 0.1885493638629482 >> > (cset-vec-cosine (Word "it") (Word "he")) = 0.4527656594627214 >> > (cset-vec-cosine (Word "he") (Word "she")) = 0.1877589816088902 >> > >> > I can post the database on mondy, let me know when you're ready to >> > receive it. >> > >> > --linas >> > >> > >> >> >> >> -- >> Ben Goertzel, PhD >> http://goertzel.org >> >> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the >> boundary, I am the peak." -- Alexander Scriabin > > > -- > You received this message because you are subscribed to the Google Groups > "link-grammar" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/link-grammar. > For more options, visit https://groups.google.com/d/optout. -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBffAwky84hNMJbw-RT0xOOx8Vus%3D_-b0vO1rrVW-1CQzg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
