I suspect these cannot be common garden-variety milk machines, and the milk machines need to embody some understanding of the grammatical/semantic structures they are working to extract...
-- Ben On Mon, Apr 1, 2019 at 12:51 PM Anton Kolonin @ Gmail <[email protected]> wrote: > > Hi Linas, I like this thread more and more :-) > > >But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" > >(URE) instead of link grammar at its core, > > Linas, the "extraction of phrasemes" goal approaching has been discussed > exactly in terms of MST->GL->URL on the last fall in Hong Kong discussion: > https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit > > That is: > > 1) Do MST-parsing to get word links proto-disjuncts > > 2) Do Grammar Learning to cluster and conclude word categories and rules with > disjuncts > > 3) Do URE-kind-of-thing to build the rules into "phrasemes" or "sections" or > "patterns". > > However, your current discourse and our current results just show that "no > one is be able to do reasonable MST-parsing" so the above is just waste of > time, correct? > > At the time we speak, Ben, Alexely, Sergey and Asuares are trying to use > DNN/BERT magic to do the trick 1. To my mind, that may get possible only if > the DNN/BERT magic do the trick having the steps 2 and 3 done under the hood. > If this is done, in such case, we don't need to do 2 and 3 after we have the > DNN/BERT-based model, because we can simply "milk-out" the grammar rules out > of DNN/BERT micelium for that. And we don't need the ULL as well by the way, > because we just need DNN/BERT and rows of different sorts of milk machines > around it. > > So, instead of solving the problem of constructing the pipeline for learning > grammar from raw text we need to solve the problem of milking the grammar out > of DNN/BERT model trained on these texts, right? > > However, either way, we need to understand algorithmic machinery of how the > links assemble in disjuncts and disjuncts assemble into sections, through the > universe-scale combinatorial explosion. And I agree that clustering and > categorizing word and links (and then disjuncts and sections, right) is part > of the process - explicitly in ULL pipeline or implicitly deep in DNN/BERT > darkness. > > Cheers, > > -Anton > > > 01.04.2019 9:17, Linas Vepstas: > > > > On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]> wrote: >> >> Linas Vepstas wrote: >> >> >... knowledge extraction can be done generically, and not just on language. >> >> If link grammar would be Turing complete, this might be possible right away. > > > In my experience, thinking about Turing completeness is unproductive and a > distraction. > >> But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" >> (URE) instead of link grammar at its core, > > > No. It has the rule-engine because back then, I did not understand sheaves. > I'm starting to think that the rule engine is a strategic mistake. The > original idea is that rule-application is the main conceptual abstraction of > term-rewriting. One rewrites, or proves theorems by applying sequences of > rules. It turns out that discovering the right sequence is hard. Finding > correct long sequences is hard - a combinatorial explosion. > > The openpsi system addresses some of these issues. Unfortunately, it's > current implementation is a tangle of rule-selection mechanisms, and theories > of human psychology. It's probably better than the URE, but is currently not > as powerful. > > I'm trying to place a theory of sheaves as a replacement for URE, and as the > natural generalization of openpsi, but I've successfully self-sabotaged > myself in these efforts. > >> >> and with URE things get much more complicated. I'm sorry, but that is still >> a Gordian knot to me, considering all of my modest knowledge. > > > We all have modest knowledge. That is the nature of the human condition. > >> >> On the other hand, if someone really smart would provide automatic grammar >> extraction by means of unrestricted grammar, I believe that would be it. > > > Yes, that is the goal of the language-learning project. However, as noted in > my last email (on the link-grammar list) it is not enough to just learn a > semi-Thue system, declare victory, and go home. The example I gave there: > > "I think that you should give that car a second look" > "you should really give that song a second listen" > "maybe you should give Sue a second chance". > > Learning to parse these "set phrases" or phrasemes is equivalent to learning > a semi-Thue system; however, its not enough to realize that all three are > forms of advice-giving, having "conserved" or "fixed" regions "x YOU SHOULD y > GIVE z SECOND w" where z is very highly variable having millions of > variations, and w only has a few dozen allowed variations. Note that the > words "fixed", "conserved", "variable" are words used in genetics and > proteomics and antibody structure. Its the same idea. > > The goal of learning lexical functions (LF's) is to learn that all three are > advice-giving forms, and also to learn what is, and what can be plugged in > for x,y,z,w. So, although a super-whiz-bang grammar learner capable of > learning context-sensitive languages should be able to learn "x YOU SHOULD y > GIVE z SECOND w", it still will not know the *meaning* of this phrase. To > know the *meaning*, you have to know the acceptable ranges (as fuzzy-sets) of > x,y,z,w. > > To conclude, thinking about Turing-completeness is a waste of time, because > Turing completeness only tells you that "x YOU SHOULD y GIVE z SECOND w" is > recursively enumerable; it does not tell you what it actually means. > > Put another way: having a universal Turing machine is not the same as > knowing how some particular program works. Automagically learning a > context-sensitive grammar is not enough to know what that grammar is > "saying/doing". > > -- Linas > >> >> >> Thank you, >> Ivan V. >> >> >> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]> >> napisao je: >>> >>> Ben, Linas, >>> >>> >But we know that MST parsing is shit. Stop wasting time on MST or trying >>> >to "improve" it. >>> >>> I think that sounds like kind of support for the concept of "dumb explosive >>> parsing" being advocated for 1+ year ago: >>> >>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy >>> >>> I also agree we other Linas'es reasoning in this thread. I would consider >>> giving it a try starting next month if we don't have a breakthrough with >>> DNN-MI-milking-based-MST-Parsing by that time. >>> >>> > can be done generically, and not just on language >>> >>> I think everyone in bio-informatics dreams of extracting secrets of "dark >>> side of the genome" with something like that ;-) >>> >>> Cheers, >>> >>> -Anton >>> >>> >>> 28.03.2019 1:24, Linas Vepstas пишет: >>> >>> Hi Anton, >>> >>> I've cc'ed the link-grammar mailing list, because I describe below some >>> concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing >>> list and ivan vodisek, because after studying hilbert systems, I think he's >>> ready to think about how knowledge extraction can be done generically, and >>> not just on language. >>> >>> -- Linas >>> >>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]> >>> wrote: >>>> >>>> Hi Linas, >>>> >>>> >I'd call it "interesting", but maybe not "golden" >>>> >>>> These are randomly selected sentences from "Gutenberg Children" corpus: >>>> >>>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ >>>> >>>> "Gutenberg Children silver standard" is LG-English parses: >>>> >>>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull >>>> >>>> "Gutenberg Children gold standard" is subset of "silver standard" with >>>> semi-random selection of sentences skipping direct speech and doing manual >>>> verification of the links. >>>> >>>> So as long as we are training on "Gutenberg Children" corpus, having the >>>> test on the same "Gutenberg Children" seems reasonable, right? >>> >>> >>> Yes. You still need to verify that each word in the "golden" corpus occurs >>> at least N=10 or 20 times in the training corpus. The dependency of >>> accuracy on N is not generally known, but it is very clear that if a word >>> occurs only N=3 times in the training corpus, then whatever is learned >>> about it will be very low quality. >>> >>>> >>>> But thanks, we may have put mire effort in removal of ancient >>>> constructions and words even if these are present in the corpus. >>> >>> If you consistently train on 19th century literature, and then evaluate >>> 19th-century literature comprehension, that's fine. Just don't expect it >>> to work for 21st century blog posts. >>> >>> The strongest effect will be the N=number of observations effect. >>> >>>> >>>> >Anyway -- you only indicate pair-wise word-links. Is the omission of >>>> >disjuncts intentional? >>>> >>>> If you have all links in the sentence, you can construct all of the >>>> disjuncts with o ambiguity, correct? >>> >>> No, but only because you did not indicate the link-type. The whole point >>> of a clustering step is to obtain a link-type; if you discard it, you will >>> never get better-than-MST results. The link-type is critical for obtaining >>> the word-classes. The whole point of learning is to learn the >>> word-classes; you've learned very little, if you know only word-pairs. >>> >>> Consider this example: >>> >>> I saw wood >>> I saw some wood >>> >>> A solution that would be "almost perfect" (or "golden") would be this: >>> >>> saw: {performer-of-actions}- & {sculptable-mass}+; >>> saw: {observer}- & {viewable-thing}+; >>> >>> These disambiguate the two different senses of the word "saw". It's >>> impossible to have word-sense disambiguation without actually having these >>> disjuncts. The word-pairs alone are not sufficient to report the link-type >>> connecting the words. Clustering gives the other dictionary entries: >>> >>> I: {performer-of-actions}+ or {observer}+; >>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); >>> some: {quantity-determiner}+; >>> >>> Thus, the pronoun "I" also belong to two different word-sense categories: >>> performers and observers. Compare to: >>> >>> "The chainsaw saws wood" -- a "chainsaw" can be a "performer of actions" >>> but cannot be an "observer". >>> "The dog saw some wood" -- dogs can be observers. They can perform some >>> actions; like run, jump, but they cannot saw, hammer, cut, stab. >>> >>> The link-type is absolutely crucial to understanding a word. The >>> language-learning project is all about learning the link-types. Without >>> correct link-type assignments, you cannot have correct parses. >>> >>> ... which is 100% of the problem with MST. The problem with MST is not so >>> much that "its not accurate" -sure, it is not terribly accurate. But even >>> if MST or some MST-replacement was 100% accurate, it would still be "wrong" >>> because it fails to indicate the link-type. If you want to understand a >>> sentence, you MUST know the link-types! >>> >>> Otherwise, you just have "green ideas sleep furiously", which parses, but >>> only because the link types have been erased, or made stupid. Here's a >>> stupid grammar: >>> >>> ideas: {adjective}- & {verb}+; >>> green: {adjective}+; >>> >>> which allows "green ideas" to parse. But of course, this is wrong; it >>> should have been: >>> >>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+; >>> green: {physical-object-modifier}+; >>> >>> and now it is clear that "green ideas" cannot parse, because the link-types >>> clash. >>> >>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you >>> will get very low quality grammars. >>> >>> * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This >>> is what deep-learning/neural-nets do: this is why the deep-learning systems >>> seem to give nice results: 200 or 300 features is enough to start having >>> adequate functional distinctions (e.g. the famous "king - >>> male+female=queen" example, or "paris-france+germany=berlin" example) >>> >>> * If you cluster to 3K to 8K clusters, you start having a quite decent >>> model of language >>> >>> * Note that wordnet has 117K "synsets". >>> >>> Note that in the above example: >>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-); >>> >>> the things in the curly-braces are effectively "synsets". >>> >>> The next set of goal-posts is to have disjuncts, of maybe low-medium >>> quality, and use these to extract ontologies. e.g. >>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing} >>> >>> You can try to do this by clustering but there are probably better ways of >>> discovering ontology. >>> >>> >>>> >>>> >Also -- no hint of any word-classes or part-of-speech tagging? This is >>>> >surely important to evaluate as well, or is this to be done in some other >>>> >way? i.e. to evaluate if "Pivi" was correctly clustered with other given >>>> >names? Or that lama/llama was clustered with other four-legged animals? >>>> >>>> We don't have that in MST-Parsing, right? We need this corpus to assess >>>> the quality of the MST-Parsing so we don't need part-of-speech information >>>> for that. >>> >>> But we know that MST parsing is shit. Stop wasting time on MST or trying >>> to "improve" it. We already know that it is close to a high-entropy path to >>> structure; trying to squeeze a few more percent of entropy is not worth the >>> effort, not at this time. Focus on finding a high-entropy structure >>> extraction algorithm, don't waste time on MST. >>> >>> You should be focusing on extracting disjuncts, word-classes, word-senses, >>> and trying to improve the quality of those. If you obtain a high-entropy >>> path to these structures, the quality of your parses will automatically >>> improve. Focus on the entropy numbers. Try to maximize that. >>> >>>> The clustering is able to do that anyway - see the graphs in the end of >>>> the last year report: >>>> >>>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou >>>> >>>> >Also -- I can't tell -- is it free of loops, or are loops allowed? >>>> >Allowing loops tends to provide stronger, more accurate parses. Loops >>>> >act as constraints. >>>> >>>> The loops and crossing links are not allowed in the MST-Parser now. If we >>>> allow them in the test corpus, how could it make assessment of MST-Parses >>>> better? >>>> >>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's >>>> directions. >>> >>> >>> Not to say bad things about Ben, but I'm certain he has not actually >>> thought about this problem very much. He is very very busy doing other >>> things; he is not thinking about this stuff. I have repeatedly tried to >>> explain the issues to him, and its quite clear that he is far away from >>> understanding them, from working at the level that I would like to have you >>> and your team work at. >>> >>> I'm trying to have you make small, quantified baby-steps, to verify the >>> accuracy of your methods and data. What I'm seeing is that you are >>> attempting to make giant-steps, without verification, and then getting >>> low-quality results, without understanding the root causes for them. You >>> can't dig yourself out of a ditch, and digging harder and more furiously >>> won't raise the accuracy of the parse results. >>> >>> --linas >>> >>>> We have your MST-Parser-less idea on the map but we are NOT trying it now: >>>> >>>> https://github.com/singnet/language-learning/issues/170 >>>> >>>> We may try it after we explore the account for costs >>>> >>>> https://github.com/singnet/language-learning/issues/183 >>>> >>>> Thanks, >>>> >>>> -Anton >>>> >>>> 24.03.2019 9:24, Linas Vepstas пишет: >>>> >>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the >>>> knob, trembling like a leaf." correctly. It is one of a class of sentences >>>> it does not know about. Which is maybe OK, because ideally, the learned >>>> grammar will be able to do this. But today, LG cannot. >>>> >>>> --linas >>>> >>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]> >>>> wrote: >>>>> >>>>> Anton, >>>>> >>>>> It's certainly an unusual corpus, and it might give you rather low >>>>> scores. I'd call it "interesting", but maybe not "golden". Although I >>>>> suppose it depends on your training corpus. Here are some problems that >>>>> pop out: >>>>> >>>>> First sentence -- >>>>> "the old beast was whinnying on his shoulder" -- the word "whinnying" is >>>>> a fairly rare English verb -- you could read half-a-million wikipedia >>>>> articles, and not see it once. You could read lots of 19th-century or >>>>> early-20th century cowboy/adventure novels, (like what you'd find on >>>>> Project Gutenberg) and maybe see it some fair amount. Even then -- to >>>>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? >>>>> How often does that happen, in any cowboy novel? "to whinny on something" >>>>> is an extremely rare construction. It will work only if you've correctly >>>>> categorized "whinny" as a verb that can take a preposition. Are your >>>>> clustering algos that good, yet, to correctly cluster rare words into >>>>> appropriate verb categories? >>>>> >>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never >>>>> heard of it as a name before. Your training data is going to be >>>>> extremely slim on this. And lack of training data means poor statistics, >>>>> which means low scores. Unless -- again, your clustering code is good >>>>> enough to place "Jims" in a "proper name" cluster... >>>>> >>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost >>>>> archaic verb. These days, everyone spells llama with two ll's not one. >>>>> Unless your talking about Buddhist monks, its a typo. >>>>> >>>>> "you understand?" is .. awkward. Common in speech, uncommon in writing. >>>>> Unlikely that you'll have enough training data for this. >>>>> >>>>> "Willard" is an uncommon name. Does your training corp[us have a >>>>> sufficient number of mentions of Willard? Do you have clustering working >>>>> well enough to stick "Willard" into a cluster with other names? >>>>> >>>>> "it is so with Sammy Jay" is clearly archaic English. >>>>> >>>>> "he hasn't any relations here" is clearly archaic, an olde-fashioned >>>>> construction. >>>>> >>>>> "Pivi said not one word" - again, a clearly old-fashioned construction. >>>>> Does the training set contain enough examples of "Pivi" to recognize it >>>>> as a name? Are names clustering correctly? >>>>> >>>>> Any sentence with an inversion is going to sound old-fashioned. All of >>>>> the sentences in that corpus sound old-fashioned. Which maybe is OK if >>>>> you are training on 19th century Gutenberg texts .. but its certainly not >>>>> modern English. Even when I was a child, and I read those old >>>>> crumbly-yellow paper adventure books, part of the fun was that no one >>>>> actually talked that way -- not at school, not at home, not on TV. It was >>>>> clearly from a different time and place -- an adventure. >>>>> >>>>> Anyway -- you only indicate pair-wise word-links. Is the omission of >>>>> disjuncts intentional? Also -- no hint of any word-classes or >>>>> part-of-speech tagging? This is surely important to evaluate as well, or >>>>> is this to be done in some other way? i.e. to evaluate if "Pivi" was >>>>> correctly clustered with other given names? Or that lama/llama was >>>>> clustered with other four-legged animals? >>>>> >>>>> Also -- I can't tell -- is it free of loops, or are loops allowed? >>>>> Allowing loops tends to provide stronger, more accurate parses. Loops >>>>> act as constraints. >>>>> >>>>> -- Linas >>>>> >>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail >>>>> <[email protected]> wrote: >>>>>> >>>>>> Hi Linas, Andes and whoever understands LG and English well enough both. >>>>>> >>>>>> Attached are first 100 sentences for GC "gold standard" - manually >>>>>> checked based on LG parses. >>>>>> >>>>>> We are expecting more to come in the next two weeks. >>>>>> >>>>>> To enable that, please have cursory review of the corpus and let us know >>>>>> if there are corrections still needed so your corrections will be used >>>>>> as a reference to fix the rest and keep going further. >>>>>> >>>>>> Thank you, >>>>>> >>>>>> -Anton >>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "lang-learn" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>>> an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> >>>>> >>>>> -- >>>>> cassette tapes - analog TV - film cameras - you >>>> >>>> >>>> >>>> -- >>>> cassette tapes - analog TV - film cameras - you >>>> >>>> -- >>>> -Anton Kolonin >>>> skype: akolonin >>>> cell: +79139250058 >>>> [email protected] >>>> https://aigents.com >>>> https://www.youtube.com/aigents >>>> https://www.facebook.com/aigents >>>> https://medium.com/@aigents >>>> https://steemit.com/@aigents >>>> https://golos.blog/@aigents >>>> https://vk.com/aigents >>> >>> >>> >>> -- >>> cassette tapes - analog TV - film cameras - you >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "lang-learn" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >>> To post to this group, send email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> -- >>> -Anton Kolonin >>> skype: akolonin >>> cell: +79139250058 >>> [email protected] >>> https://aigents.com >>> https://www.youtube.com/aigents >>> https://www.facebook.com/aigents >>> https://medium.com/@aigents >>> https://steemit.com/@aigents >>> https://golos.blog/@aigents >>> https://vk.com/aigents > > > > -- > cassette tapes - analog TV - film cameras - you > > -- > -Anton Kolonin > skype: akolonin > cell: +79139250058 > [email protected] > https://aigents.com > https://www.youtube.com/aigents > https://www.facebook.com/aigents > https://medium.com/@aigents > https://steemit.com/@aigents > https://golos.blog/@aigents > https://vk.com/aigents -- Ben Goertzel, PhD http://goertzel.org "Listen: This world is the lunatic's sphere, / Don't always agree it's real. / Even with my feet upon it / And the postman knowing my door / My address is somewhere else." -- Hafiz -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBcfJ9z-5eqFFPWH%2Bpx13d0ax9nGPdEeP1GGb8M_0xk9MQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
