Hi Anton, I've cc'ed the link-grammar mailing list, because I describe below some concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing list and ivan vodisek, because after studying hilbert systems, I think he's ready to think about how knowledge extraction can be done generically, and not just on language.
-- Linas On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]> wrote: > Hi Linas, > > >I'd call it "interesting", but maybe not "golden" > > These are randomly selected sentences from "Gutenberg Children" corpus: > > > http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ > > "Gutenberg Children silver standard" is LG-English parses: > > > http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull > > "Gutenberg Children gold standard" is subset of "silver standard" with > semi-random selection of sentences skipping direct speech and doing manual > verification of the links. > > So as long as we are training on "Gutenberg Children" corpus, having the > test on the same "Gutenberg Children" seems reasonable, right? > Yes. You still need to verify that each word in the "golden" corpus occurs at least N=10 or 20 times in the training corpus. The dependency of accuracy on N is not generally known, but it is very clear that if a word occurs only N=3 times in the training corpus, then whatever is learned about it will be very low quality. > But thanks, we may have put mire effort in removal of ancient > constructions and words even if these are present in the corpus. > If you consistently train on 19th century literature, and then evaluate 19th-century literature comprehension, that's fine. Just don't expect it to work for 21st century blog posts. The strongest effect will be the N=number of observations effect. > >Anyway -- you only indicate pair-wise word-links. Is the omission of > disjuncts intentional? > > If you have all links in the sentence, you can construct all of the > disjuncts with o ambiguity, correct? > No, but only because you did not indicate the link-type. The whole point of a clustering step is to obtain a link-type; if you discard it, you will never get better-than-MST results. The link-type is critical for obtaining the word-classes. The whole point of learning is to learn the word-classes; you've learned very little, if you know only word-pairs. Consider this example: I saw wood I saw some wood A solution that would be "almost perfect" (or "golden") would be this: saw: {performer-of-actions}- & {sculptable-mass}+; saw: {observer}- & {viewable-thing}+; These disambiguate the two different senses of the word "saw". It's impossible to have word-sense disambiguation without actually having these disjuncts. The word-pairs alone are not sufficient to report the link-type connecting the words. Clustering gives the other dictionary entries: I: {performer-of-actions}+ or {observer}+; wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); some: {quantity-determiner}+; Thus, the pronoun "I" also belong to two different word-sense categories: performers and observers. Compare to: "The chainsaw saws wood" -- a "chainsaw" can be a "performer of actions" but cannot be an "observer". "The dog saw some wood" -- dogs can be observers. They can perform some actions; like run, jump, but they cannot saw, hammer, cut, stab. The link-type is absolutely crucial to understanding a word. The language-learning project is all about learning the link-types. Without correct link-type assignments, you cannot have correct parses. ... which is 100% of the problem with MST. The problem with MST is not so much that "its not accurate" -sure, it is not terribly accurate. But even if MST or some MST-replacement was 100% accurate, it would still be "wrong" because it fails to indicate the link-type. If you want to understand a sentence, you MUST know the link-types! Otherwise, you just have "green ideas sleep furiously", which parses, but only because the link types have been erased, or made stupid. Here's a stupid grammar: ideas: {adjective}- & {verb}+; green: {adjective}+; which allows "green ideas" to parse. But of course, this is wrong; it should have been: ideas: {noospheric-modifier}- & {concept-manipulating-verb}+; green: {physical-object-modifier}+; and now it is clear that "green ideas" cannot parse, because the link-types clash. * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you will get very low quality grammars. * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This is what deep-learning/neural-nets do: this is why the deep-learning systems seem to give nice results: 200 or 300 features is enough to start having adequate functional distinctions (e.g. the famous "king - male+female=queen" example, or "paris-france+germany=berlin" example) * If you cluster to 3K to 8K clusters, you start having a quite decent model of language * Note that wordnet has 117K "synsets". Note that in the above example: wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); the things in the curly-braces are effectively "synsets". The next set of goal-posts is to have disjuncts, of maybe low-medium quality, and use these to extract ontologies. e.g. {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing} You can try to do this by clustering but there are probably better ways of discovering ontology. > >Also -- no hint of any word-classes or part-of-speech tagging? This is > surely important to evaluate as well, or is this to be done in some other > way? i.e. to evaluate if "Pivi" was correctly clustered with other given > names? Or that lama/llama was clustered with other four-legged animals? > > We don't have that in MST-Parsing, right? We need this corpus to assess > the quality of the MST-Parsing so we don't need part-of-speech information > for that. > But we know that MST parsing is shit. Stop wasting time on MST or trying to "improve" it. We already know that it is close to a high-entropy path to structure; trying to squeeze a few more percent of entropy is not worth the effort, not at this time. Focus on finding a high-entropy structure extraction algorithm, don't waste time on MST. You should be focusing on extracting disjuncts, word-classes, word-senses, and trying to improve the quality of those. If you obtain a high-entropy path to these structures, the quality of your parses will automatically improve. Focus on the entropy numbers. Try to maximize that. The clustering is able to do that anyway - see the graphs in the end of the > last year report: > > > https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou > > >Also -- I can't tell -- is it free of loops, or are loops allowed? > Allowing loops tends to provide stronger, more accurate parses. Loops act > as constraints. > > The loops and crossing links are not allowed in the MST-Parser now. If we > allow them in the test corpus, how could it make assessment of MST-Parses > better? > > Note, that we ARE working we MST-Parses now - accordingly to Ben's > directions. > Not to say bad things about Ben, but I'm certain he has not actually thought about this problem very much. He is very very busy doing other things; he is not thinking about this stuff. I have repeatedly tried to explain the issues to him, and its quite clear that he is far away from understanding them, from working at the level that I would like to have you and your team work at. I'm trying to have you make small, quantified baby-steps, to verify the accuracy of your methods and data. What I'm seeing is that you are attempting to make giant-steps, without verification, and then getting low-quality results, without understanding the root causes for them. You can't dig yourself out of a ditch, and digging harder and more furiously won't raise the accuracy of the parse results. --linas We have your MST-Parser-less idea on the map but we are NOT trying it now: > > https://github.com/singnet/language-learning/issues/170 > > We may try it after we explore the account for costs > > https://github.com/singnet/language-learning/issues/183 > > Thanks, > > -Anton > 24.03.2019 9:24, Linas Vepstas пишет: > > Also, BTW, link-grammar cannot parse "I just stood there, my hand on the > knob, trembling like a leaf." correctly. It is one of a class of sentences > it does not know about. Which is maybe OK, because ideally, the learned > grammar will be able to do this. But today, LG cannot. > > --linas > > On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]> > wrote: > >> Anton, >> >> It's certainly an unusual corpus, and it might give you rather low >> scores. I'd call it "interesting", but maybe not "golden". Although I >> suppose it depends on your training corpus. Here are some problems that >> pop out: >> >> First sentence -- >> "the old beast was whinnying on his shoulder" -- the word "whinnying" is >> a fairly rare English verb -- you could read half-a-million wikipedia >> articles, and not see it once. You could read lots of 19th-century or >> early-20th century cowboy/adventure novels, (like what you'd find on >> Project Gutenberg) and maybe see it some fair amount. Even then -- to >> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How >> often does that happen, in any cowboy novel? "to whinny on something" is an >> extremely rare construction. It will work only if you've correctly >> categorized "whinny" as a verb that can take a preposition. Are your >> clustering algos that good, yet, to correctly cluster rare words into >> appropriate verb categories? >> >> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never >> heard of it as a name before. Your training data is going to be extremely >> slim on this. And lack of training data means poor statistics, which means >> low scores. Unless -- again, your clustering code is good enough to place >> "Jims" in a "proper name" cluster... >> >> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost >> archaic verb. These days, everyone spells llama with two ll's not one. >> Unless your talking about Buddhist monks, its a typo. >> >> "you understand?" is .. awkward. Common in speech, uncommon in writing. >> Unlikely that you'll have enough training data for this. >> >> "Willard" is an uncommon name. Does your training corp[us have a >> sufficient number of mentions of Willard? Do you have clustering working >> well enough to stick "Willard" into a cluster with other names? >> >> "it is so with Sammy Jay" is clearly archaic English. >> >> "he hasn't any relations here" is clearly archaic, an olde-fashioned >> construction. >> >> "Pivi said not one word" - again, a clearly old-fashioned construction. >> Does the training set contain enough examples of "Pivi" to recognize it as >> a name? Are names clustering correctly? >> >> Any sentence with an inversion is going to sound old-fashioned. All of >> the sentences in that corpus sound old-fashioned. Which maybe is OK if you >> are training on 19th century Gutenberg texts .. but its certainly not >> modern English. Even when I was a child, and I read those old >> crumbly-yellow paper adventure books, part of the fun was that no one >> actually talked that way -- not at school, not at home, not on TV. It was >> clearly from a different time and place -- an adventure. >> >> Anyway -- you only indicate pair-wise word-links. Is the omission of >> disjuncts intentional? Also -- no hint of any word-classes or >> part-of-speech tagging? This is surely important to evaluate as well, or is >> this to be done in some other way? i.e. to evaluate if "Pivi" was >> correctly clustered with other given names? Or that lama/llama was >> clustered with other four-legged animals? >> >> Also -- I can't tell -- is it free of loops, or are loops allowed? >> Allowing loops tends to provide stronger, more accurate parses. Loops act >> as constraints. >> >> -- Linas >> >> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail < >> [email protected]> wrote: >> >>> Hi Linas, Andes and whoever understands LG and English well enough both. >>> >>> Attached are first 100 sentences for GC "gold standard" - manually >>> checked based on LG parses. >>> >>> We are expecting more to come in the next two weeks. >>> >>> To enable that, please have cursory review of the corpus and let us know >>> if there are corrections still needed so your corrections will be used as a >>> reference to fix the rest and keep going further. >>> >>> Thank you, >>> >>> -Anton >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "lang-learn" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com >>> <https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> > > > -- > cassette tapes - analog TV - film cameras - you > > -- > -Anton Kolonin > skype: akolonin > cell: > [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents > > -- cassette tapes - analog TV - film cameras - you -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
