Hi Ben, On Mon, Jun 19, 2017 at 9:01 AM, Ben Goertzel <[email protected]> wrote:
> Hi Linas, > > I have read the report now... > > Looking at the cosine similarity results, it seems clear the corpus > you're using is way too small for the purpose (there's no good reason > "He" and "There" should have such high cosine similarity..., cf table > on page 6) > Well, yes, exactly, but I think you missed the point: The reason these are both capitalized is because they both start sentences, and there are simply not that many sentences that start with "He" and "There". The similarity for "he" and "there" is much lower. Assuming sentences have 20 words in them, the capitalized-word corpus is 20x smaller than the non-capitalized corpus. That's why I moved on to the non-capitalized results. And that's what I mean: the cosine similarity was judged higher only because there are 20x fewer observations of capitalized words! We don't need/want a measure that reports high similarity whenever there are fewer observations! > > Also, cosine similarity is known to be fluky for this sort of > application. One will get much less fluky pairwise similarities using > a modern dimension reduction technique like word2vec (but using it on > feature vectors produced from the MST parses, rather than just from > word sequences).... However, word2vec does not handle word sense > disambiguation, which is why I've suggested Adagram (but again, > modified to use feature vectors produced from the MST parses...) > > Basically what I am thinking to explore is > > -- Adagram on MST parse based feature vectors, to produce > reduced-dimension vectors for word-senses > > -- Cluster these reduced-dimension vectors to form word-categories > (not sure what clustering algorithm to use here, could be EM I guess, > or agglomerative as you've suggested... but the point is clustering is > easier on these dimension-reduced vectors because the similarity > degrees are less fluky...) > OK, so we are mis-communicating, misunderstanding each-other. I think the cosine data, for the NON-CAPITALIZED words, is good enough to do clustering on. I was trying to use a variant of PCA for CLUSTERING! and NOT for similarity! I've already got similarity: the PCA was being applied to the cosine similarity! It would be nice to have a better similarity than cosine, and maybe adagram can provide this. But that is not where the action is. I am ready to cluster NOW; I've been ready for weeks, for a month, and I am searching for a high-performance, accurate clustering algo that is less ad-hoc than k-means or agglomerative, or whatever. Thus, the cryptic note about "hidden multivariate logistic regression" is about doing that for clustering!! In short, clustering is where we're at; better similarity scores would be nice, but very much of secondary importance. > -- Tag the corpus using these word categories and do the MI analysis > and MST parsing again ... > Well, once you've tagged, its not an MST parse, its an LG parse. > > I also think we might get better MST parses if we used asymmetric > relative entropy instead of symmetric mutual information. If you're > not motivated to experiment with this may be we will try it ourselves > in HK... > Yes, I want to try that, but got distracted by other things. It might be nice to get "better MST parses", but right now, we don't have any evidence that they're bad. They seem to be rather reasonable quality, to me, and this is attested by the fact that the cosine similarity for the NON-CAPITALIZED words seems pretty good! So again, this is not where the action is. What we need is accurate, high-performance, non-ad-hoc clustering. I guess I'm ready to accept agglomerative clustering, if there's nothing else that's simpler, better. Once clustering is done, I want to move on to morphology, so that I can do e.g. French, any of the romance or slavic languages. Its in morphology where things like asymmetric relative entropy should really start kicking butt. I mean, it would be nice for English, but it seems like a lower priority. These other fires are a lot more urgent. --linas > -- Ben > > > > > > > > On Mon, Jun 19, 2017 at 3:30 PM, Linas Vepstas <[email protected]> > wrote: > > Hi Ben, > > > > Here's this week's update on results from the natural language datasets. > In > > short, the datasets seem to be of high quality, based on a sampling of > the > > cosine similarity between words. Looks really nice. > > > > Naive PCA stinks as a classifier; I'm looking for something nicer, and > > perhaps based on first principles, and a bit less ad-hoc. > > > > Since you had the guts to use the words "algebraic topology" in a recent > > email, I call your bluff and raise: this report includes a brief, short > > sketch that points out that every language, natural or otherwise, has an > > associated cohomology theory. The path, from here to there, goes by > means of > > sheaves. Which is semi-obvious because every book on algebraic topology > or > > at least differential topology explains the steps. > > > > The part that's new, to me, was the sudden realization that the > "disjuncts" > > and "connector sets" of Link Grammar are in fact just the sheaves (germs, > > stalks) of a graph. The Link Grammar dictionary, say, for the English > > language, is a sheaf with a probability distribution on it. > > > > BTW, this clarifies why Link Grammar looks so damned modal-logic-ish. I > > noticed this a long ago, and always thought it was mysterious and weird > and > > interesting. Well, it turns out that, for some folks, this is old news: > > apparently, when the language is first-order logic, then the > sheafification > > of first-order logic gives you Kripke-Joyal semantics; this was spotted > in > > 1965. So I'm guessing that this is generic: take any language, any > formal > > language, or a natural language, look at it from the point of sheaves, > and > > then observe that the gluing axioms mean that modal logic describes how > the > > sections glue together. I think that's pretty cool. > > > > So, can you find a grad student to work out the details? The thesis title > > would be "the Cohomology of the English Language". It would fill in all > the > > details in the above paragraphs. > > > > --linas > > > > -- > > You received this message because you are subscribed to the Google Groups > > "link-grammar" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > To post to this group, send email to [email protected]. > > Visit this group at https://groups.google.com/group/link-grammar. > > For more options, visit https://groups.google.com/d/optout. > > > > -- > Ben Goertzel, PhD > http://goertzel.org > > "I am God! I am nothing, I'm play, I am freedom, I am life. I am the > boundary, I am the peak." -- Alexander Scriabin > > -- > You received this message because you are subscribed to the Google Groups > "link-grammar" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/link-grammar. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA3538w%3D4j79S-D78Bdr5M5Gu%2B%3D%2BNFuoYEmseiAWbgsq_WQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
