Hi Linas, I have read the report now...
Looking at the cosine similarity results, it seems clear the corpus you're using is way too small for the purpose (there's no good reason "He" and "There" should have such high cosine similarity..., cf table on page 6) Also, cosine similarity is known to be fluky for this sort of application. One will get much less fluky pairwise similarities using a modern dimension reduction technique like word2vec (but using it on feature vectors produced from the MST parses, rather than just from word sequences).... However, word2vec does not handle word sense disambiguation, which is why I've suggested Adagram (but again, modified to use feature vectors produced from the MST parses...) Basically what I am thinking to explore is -- Adagram on MST parse based feature vectors, to produce reduced-dimension vectors for word-senses -- Cluster these reduced-dimension vectors to form word-categories (not sure what clustering algorithm to use here, could be EM I guess, or agglomerative as you've suggested... but the point is clustering is easier on these dimension-reduced vectors because the similarity degrees are less fluky...) -- Tag the corpus using these word categories and do the MI analysis and MST parsing again ... I also think we might get better MST parses if we used asymmetric relative entropy instead of symmetric mutual information. If you're not motivated to experiment with this may be we will try it ourselves in HK... -- Ben On Mon, Jun 19, 2017 at 3:30 PM, Linas Vepstas <[email protected]> wrote: > Hi Ben, > > Here's this week's update on results from the natural language datasets. In > short, the datasets seem to be of high quality, based on a sampling of the > cosine similarity between words. Looks really nice. > > Naive PCA stinks as a classifier; I'm looking for something nicer, and > perhaps based on first principles, and a bit less ad-hoc. > > Since you had the guts to use the words "algebraic topology" in a recent > email, I call your bluff and raise: this report includes a brief, short > sketch that points out that every language, natural or otherwise, has an > associated cohomology theory. The path, from here to there, goes by means of > sheaves. Which is semi-obvious because every book on algebraic topology or > at least differential topology explains the steps. > > The part that's new, to me, was the sudden realization that the "disjuncts" > and "connector sets" of Link Grammar are in fact just the sheaves (germs, > stalks) of a graph. The Link Grammar dictionary, say, for the English > language, is a sheaf with a probability distribution on it. > > BTW, this clarifies why Link Grammar looks so damned modal-logic-ish. I > noticed this a long ago, and always thought it was mysterious and weird and > interesting. Well, it turns out that, for some folks, this is old news: > apparently, when the language is first-order logic, then the sheafification > of first-order logic gives you Kripke-Joyal semantics; this was spotted in > 1965. So I'm guessing that this is generic: take any language, any formal > language, or a natural language, look at it from the point of sheaves, and > then observe that the gluing axioms mean that modal logic describes how the > sections glue together. I think that's pretty cool. > > So, can you find a grad student to work out the details? The thesis title > would be "the Cohomology of the English Language". It would fill in all the > details in the above paragraphs. > > --linas > > -- > You received this message because you are subscribed to the Google Groups > "link-grammar" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/link-grammar. > For more options, visit https://groups.google.com/d/optout. -- Ben Goertzel, PhD http://goertzel.org "I am God! I am nothing, I'm play, I am freedom, I am life. I am the boundary, I am the peak." -- Alexander Scriabin -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBeigST7BWaguTU-6jzs-kVfrub3JD69AT%2BpzNxbs8q3Pw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
