Linas Vepstas wrote: >... knowledge extraction can be done generically, and not just on language.
If link grammar would be Turing complete, this might be possible right away. But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" (URE) instead of link grammar at its core, and with URE things get much more complicated. I'm sorry, but that is still a Gordian knot to me, considering all of my modest knowledge. On the other hand, if someone really smart would provide automatic grammar extraction by means of unrestricted grammar <https://en.wikipedia.org/wiki/Unrestricted_grammar>, I believe that would be it. Thank you, Ivan V. čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]> napisao je: > Ben, Linas, > > >But we know that MST parsing is shit. Stop wasting time on MST or trying > to "improve" it. > > I think that sounds like kind of support for the concept of "dumb > explosive parsing" being advocated for 1+ year ago: > > > https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy > > I also agree we other Linas'es reasoning in this thread. I would consider > giving it a try starting next month if we don't have a breakthrough with > DNN-MI-milking-based-MST-Parsing by that time. > > > can be done generically, and not just on language > > I think everyone in bio-informatics dreams of extracting secrets of "dark > side of the genome" with something like that ;-) > > Cheers, > > -Anton > > > 28.03.2019 1:24, Linas Vepstas пишет: > > Hi Anton, > > I've cc'ed the link-grammar mailing list, because I describe below some > concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing > list and ivan vodisek, because after studying hilbert systems, I think he's > ready to think about how knowledge extraction can be done generically, and > not just on language. > > -- Linas > > On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]> > wrote: > >> Hi Linas, >> >> >I'd call it "interesting", but maybe not "golden" >> >> These are randomly selected sentences from "Gutenberg Children" corpus: >> >> >> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ >> >> "Gutenberg Children silver standard" is LG-English parses: >> >> >> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull >> >> "Gutenberg Children gold standard" is subset of "silver standard" with >> semi-random selection of sentences skipping direct speech and doing manual >> verification of the links. >> >> So as long as we are training on "Gutenberg Children" corpus, having the >> test on the same "Gutenberg Children" seems reasonable, right? >> > > Yes. You still need to verify that each word in the "golden" corpus occurs > at least N=10 or 20 times in the training corpus. The dependency of > accuracy on N is not generally known, but it is very clear that if a word > occurs only N=3 times in the training corpus, then whatever is learned > about it will be very low quality. > > >> But thanks, we may have put mire effort in removal of ancient >> constructions and words even if these are present in the corpus. >> > If you consistently train on 19th century literature, and then evaluate > 19th-century literature comprehension, that's fine. Just don't expect it > to work for 21st century blog posts. > > The strongest effect will be the N=number of observations effect. > > >> >Anyway -- you only indicate pair-wise word-links. Is the omission of >> disjuncts intentional? >> >> If you have all links in the sentence, you can construct all of the >> disjuncts with o ambiguity, correct? >> > No, but only because you did not indicate the link-type. The whole point > of a clustering step is to obtain a link-type; if you discard it, you will > never get better-than-MST results. The link-type is critical for obtaining > the word-classes. The whole point of learning is to learn the > word-classes; you've learned very little, if you know only word-pairs. > > Consider this example: > > I saw wood > I saw some wood > > A solution that would be "almost perfect" (or "golden") would be this: > > saw: {performer-of-actions}- & {sculptable-mass}+; > saw: {observer}- & {viewable-thing}+; > > These disambiguate the two different senses of the word "saw". It's > impossible to have word-sense disambiguation without actually having these > disjuncts. The word-pairs alone are not sufficient to report the link-type > connecting the words. Clustering gives the other dictionary entries: > > I: {performer-of-actions}+ or {observer}+; > wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); > some: {quantity-determiner}+; > > Thus, the pronoun "I" also belong to two different word-sense categories: > performers and observers. Compare to: > > "The chainsaw saws wood" -- a "chainsaw" can be a "performer of actions" > but cannot be an "observer". > "The dog saw some wood" -- dogs can be observers. They can perform some > actions; like run, jump, but they cannot saw, hammer, cut, stab. > > The link-type is absolutely crucial to understanding a word. The > language-learning project is all about learning the link-types. Without > correct link-type assignments, you cannot have correct parses. > > ... which is 100% of the problem with MST. The problem with MST is not so > much that "its not accurate" -sure, it is not terribly accurate. But even > if MST or some MST-replacement was 100% accurate, it would still be "wrong" > because it fails to indicate the link-type. If you want to understand a > sentence, you MUST know the link-types! > > Otherwise, you just have "green ideas sleep furiously", which parses, but > only because the link types have been erased, or made stupid. Here's a > stupid grammar: > > ideas: {adjective}- & {verb}+; > green: {adjective}+; > > which allows "green ideas" to parse. But of course, this is wrong; it > should have been: > > ideas: {noospheric-modifier}- & {concept-manipulating-verb}+; > green: {physical-object-modifier}+; > > and now it is clear that "green ideas" cannot parse, because the > link-types clash. > > * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you > will get very low quality grammars. > > * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This > is what deep-learning/neural-nets do: this is why the deep-learning systems > seem to give nice results: 200 or 300 features is enough to start having > adequate functional distinctions (e.g. the famous "king - > male+female=queen" example, or "paris-france+germany=berlin" example) > > * If you cluster to 3K to 8K clusters, you start having a quite decent > model of language > > * Note that wordnet has 117K "synsets". > > Note that in the above example: > wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); > > the things in the curly-braces are effectively "synsets". > > The next set of goal-posts is to have disjuncts, of maybe low-medium > quality, and use these to extract ontologies. e.g. > {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing} > > You can try to do this by clustering but there are probably better ways of > discovering ontology. > > > >> >Also -- no hint of any word-classes or part-of-speech tagging? This is >> surely important to evaluate as well, or is this to be done in some other >> way? i.e. to evaluate if "Pivi" was correctly clustered with other given >> names? Or that lama/llama was clustered with other four-legged animals? >> >> We don't have that in MST-Parsing, right? We need this corpus to assess >> the quality of the MST-Parsing so we don't need part-of-speech information >> for that. >> > But we know that MST parsing is shit. Stop wasting time on MST or trying > to "improve" it. We already know that it is close to a high-entropy path to > structure; trying to squeeze a few more percent of entropy is not worth the > effort, not at this time. Focus on finding a high-entropy structure > extraction algorithm, don't waste time on MST. > > You should be focusing on extracting disjuncts, word-classes, word-senses, > and trying to improve the quality of those. If you obtain a high-entropy > path to these structures, the quality of your parses will automatically > improve. Focus on the entropy numbers. Try to maximize that. > > The clustering is able to do that anyway - see the graphs in the end of >> the last year report: >> >> >> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou >> >> >Also -- I can't tell -- is it free of loops, or are loops allowed? >> Allowing loops tends to provide stronger, more accurate parses. Loops act >> as constraints. >> >> The loops and crossing links are not allowed in the MST-Parser now. If we >> allow them in the test corpus, how could it make assessment of MST-Parses >> better? >> >> Note, that we ARE working we MST-Parses now - accordingly to Ben's >> directions. >> > > Not to say bad things about Ben, but I'm certain he has not actually > thought about this problem very much. He is very very busy doing other > things; he is not thinking about this stuff. I have repeatedly tried to > explain the issues to him, and its quite clear that he is far away from > understanding them, from working at the level that I would like to have you > and your team work at. > > I'm trying to have you make small, quantified baby-steps, to verify the > accuracy of your methods and data. What I'm seeing is that you are > attempting to make giant-steps, without verification, and then getting > low-quality results, without understanding the root causes for them. You > can't dig yourself out of a ditch, and digging harder and more furiously > won't raise the accuracy of the parse results. > > --linas > > We have your MST-Parser-less idea on the map but we are NOT trying it now: >> >> https://github.com/singnet/language-learning/issues/170 >> >> We may try it after we explore the account for costs >> >> https://github.com/singnet/language-learning/issues/183 >> >> Thanks, >> >> -Anton >> 24.03.2019 9:24, Linas Vepstas пишет: >> >> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the >> knob, trembling like a leaf." correctly. It is one of a class of sentences >> it does not know about. Which is maybe OK, because ideally, the learned >> grammar will be able to do this. But today, LG cannot. >> >> --linas >> >> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]> >> wrote: >> >>> Anton, >>> >>> It's certainly an unusual corpus, and it might give you rather low >>> scores. I'd call it "interesting", but maybe not "golden". Although I >>> suppose it depends on your training corpus. Here are some problems that >>> pop out: >>> >>> First sentence -- >>> "the old beast was whinnying on his shoulder" -- the word "whinnying" is >>> a fairly rare English verb -- you could read half-a-million wikipedia >>> articles, and not see it once. You could read lots of 19th-century or >>> early-20th century cowboy/adventure novels, (like what you'd find on >>> Project Gutenberg) and maybe see it some fair amount. Even then -- to >>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How >>> often does that happen, in any cowboy novel? "to whinny on something" is an >>> extremely rare construction. It will work only if you've correctly >>> categorized "whinny" as a verb that can take a preposition. Are your >>> clustering algos that good, yet, to correctly cluster rare words into >>> appropriate verb categories? >>> >>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never >>> heard of it as a name before. Your training data is going to be extremely >>> slim on this. And lack of training data means poor statistics, which means >>> low scores. Unless -- again, your clustering code is good enough to place >>> "Jims" in a "proper name" cluster... >>> >>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost >>> archaic verb. These days, everyone spells llama with two ll's not one. >>> Unless your talking about Buddhist monks, its a typo. >>> >>> "you understand?" is .. awkward. Common in speech, uncommon in writing. >>> Unlikely that you'll have enough training data for this. >>> >>> "Willard" is an uncommon name. Does your training corp[us have a >>> sufficient number of mentions of Willard? Do you have clustering working >>> well enough to stick "Willard" into a cluster with other names? >>> >>> "it is so with Sammy Jay" is clearly archaic English. >>> >>> "he hasn't any relations here" is clearly archaic, an olde-fashioned >>> construction. >>> >>> "Pivi said not one word" - again, a clearly old-fashioned construction. >>> Does the training set contain enough examples of "Pivi" to recognize it as >>> a name? Are names clustering correctly? >>> >>> Any sentence with an inversion is going to sound old-fashioned. All of >>> the sentences in that corpus sound old-fashioned. Which maybe is OK if you >>> are training on 19th century Gutenberg texts .. but its certainly not >>> modern English. Even when I was a child, and I read those old >>> crumbly-yellow paper adventure books, part of the fun was that no one >>> actually talked that way -- not at school, not at home, not on TV. It was >>> clearly from a different time and place -- an adventure. >>> >>> Anyway -- you only indicate pair-wise word-links. Is the omission of >>> disjuncts intentional? Also -- no hint of any word-classes or >>> part-of-speech tagging? This is surely important to evaluate as well, or is >>> this to be done in some other way? i.e. to evaluate if "Pivi" was >>> correctly clustered with other given names? Or that lama/llama was >>> clustered with other four-legged animals? >>> >>> Also -- I can't tell -- is it free of loops, or are loops allowed? >>> Allowing loops tends to provide stronger, more accurate parses. Loops act >>> as constraints. >>> >>> -- Linas >>> >>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail < >>> [email protected]> wrote: >>> >>>> Hi Linas, Andes and whoever understands LG and English well enough both. >>>> >>>> Attached are first 100 sentences for GC "gold standard" - manually >>>> checked based on LG parses. >>>> >>>> We are expecting more to come in the next two weeks. >>>> >>>> To enable that, please have cursory review of the corpus and let us >>>> know if there are corrections still needed so your corrections will be used >>>> as a reference to fix the rest and keep going further. >>>> >>>> Thank you, >>>> >>>> -Anton >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "lang-learn" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com >>>> <https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> cassette tapes - analog TV - film cameras - you >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> >> -- >> -Anton Kolonin >> skype: akolonin >> cell: >> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents >> >> > > -- > cassette tapes - analog TV - film cameras - you > -- > You received this message because you are subscribed to the Google Groups > "lang-learn" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com > <https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- > -Anton Kolonin > skype: akolonin > cell: > [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents > > -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAB5%3Dj6UsBGYrZtSa93Hd0PFhnmK9S7m3sCKJRAuGGABCVhhy_A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
