[opencog-dev] Re: 100 sentences for GC

Ben Goertzel Sun, 31 Mar 2019 20:53:45 -0700

I suspect these cannot be common garden-variety milk machines, and the
milk machines need to embody some understanding of the
grammatical/semantic structures they are working to extract...


-- Ben

On Mon, Apr 1, 2019 at 12:51 PM Anton Kolonin @ Gmail
<[email protected]> wrote:
>
> Hi Linas, I like this thread more and more :-)
>
> >But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" 
> >(URE) instead of link grammar at its core,
>
> Linas, the "extraction of phrasemes" goal approaching has been discussed 
> exactly in terms of MST->GL->URL on the last fall in Hong Kong discussion: 
> https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit
>
> That is:
>
> 1) Do MST-parsing to get word links proto-disjuncts
>
> 2) Do Grammar Learning to cluster and conclude word categories and rules with 
> disjuncts
>
> 3) Do URE-kind-of-thing to build the rules into "phrasemes" or "sections" or 
> "patterns".
>
> However, your current discourse and our current results just show that "no 
> one is be able to do reasonable MST-parsing" so the above is just waste of 
> time, correct?
>
> At the time we speak, Ben, Alexely, Sergey and Asuares are trying to use 
> DNN/BERT magic to do the trick 1. To my mind, that may get possible only if 
> the DNN/BERT magic do the trick having the steps 2 and 3 done under the hood. 
> If this is done, in such case, we don't need to do 2 and 3 after we have the 
> DNN/BERT-based model, because we can simply "milk-out" the grammar rules out 
> of DNN/BERT micelium for that. And we don't need the ULL as well by the way, 
> because we just need DNN/BERT and rows of different sorts of milk machines 
> around it.
>
> So, instead of solving the problem of constructing the pipeline for learning 
> grammar from raw text we need to solve the problem of milking the grammar out 
> of DNN/BERT model trained on these texts, right?
>
> However, either way, we need to understand algorithmic machinery of how the 
> links assemble in disjuncts and disjuncts assemble into sections, through the 
> universe-scale combinatorial explosion. And I agree that clustering and 
> categorizing word and links (and then disjuncts and sections, right) is part 
> of the process - explicitly in ULL pipeline or implicitly deep in DNN/BERT 
> darkness.
>
> Cheers,
>
> -Anton
>
>
> 01.04.2019 9:17, Linas Vepstas:
>
>
>
> On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]> wrote:
>>
>> Linas Vepstas wrote:
>>
>> >... knowledge extraction can be done generically, and not just on language.
>>
>> If link grammar would be Turing complete, this might be possible right away.
>
>
> In my experience, thinking about Turing completeness is unproductive and a 
> distraction.
>
>> But somehow, I suspect... Isn't this why OpenCog has "unified rule engine" 
>> (URE) instead of link grammar at its core,
>
>
> No. It has the rule-engine because back then, I did not understand sheaves.  
> I'm starting to think that the rule engine is a strategic mistake. The 
> original idea is that rule-application is the main conceptual abstraction of 
> term-rewriting.  One rewrites, or proves theorems by applying sequences of 
> rules.  It turns out that discovering the right sequence is hard. Finding 
> correct long sequences is hard - a combinatorial explosion.
>
> The openpsi system addresses some of these issues. Unfortunately, it's 
> current implementation is a tangle of rule-selection mechanisms, and theories 
> of human psychology. It's probably better than the URE, but is currently not 
> as powerful.
>
> I'm trying to place a theory of sheaves as a replacement for URE, and as the 
> natural generalization of openpsi, but I've successfully self-sabotaged 
> myself in these efforts.
>
>>
>> and with URE things get much more complicated. I'm sorry, but that is still 
>> a Gordian knot to me, considering all of my modest knowledge.
>
>
> We all have modest knowledge. That is the nature of the human condition.
>
>>
>> On the other hand, if someone really smart would provide automatic grammar 
>> extraction by means of unrestricted grammar, I believe that would be it.
>
>
> Yes, that is the goal of the language-learning project.  However, as noted in 
> my last email (on the link-grammar list) it is not enough to just learn a 
> semi-Thue system, declare victory, and go home.  The example I gave there:
>
>   "I think that you should give that car a second look"
>   "you should really give that song a second listen"
>   "maybe you should give Sue a second chance".
>
> Learning to parse these "set phrases" or phrasemes is equivalent to learning 
> a semi-Thue system; however, its not enough to realize that all three are 
> forms of advice-giving, having "conserved" or "fixed" regions "x YOU SHOULD y 
> GIVE z SECOND w" where z is very highly variable having millions of 
> variations, and w only has a few dozen allowed variations.  Note that the 
> words "fixed", "conserved", "variable" are words used in genetics and 
> proteomics and antibody structure. Its the same idea.
>
> The goal of learning lexical functions (LF's) is to learn that all three are 
> advice-giving forms, and also to learn what is, and what can be plugged in 
> for x,y,z,w.   So, although a super-whiz-bang grammar learner capable of 
> learning context-sensitive languages should be able to learn "x YOU SHOULD y 
> GIVE z SECOND w", it still will not know the *meaning* of this phrase.  To 
> know the *meaning*, you have to know the acceptable ranges (as fuzzy-sets) of 
> x,y,z,w.
>
> To conclude, thinking about Turing-completeness is a waste of time, because 
> Turing completeness only tells you that "x YOU SHOULD y GIVE z SECOND w" is 
> recursively enumerable; it does not tell you what it actually means.
>
> Put another way:  having a universal Turing machine is not the same as 
> knowing how some particular program works. Automagically learning a 
> context-sensitive grammar is not enough to know what that grammar is 
> "saying/doing".
>
> -- Linas
>
>>
>>
>> Thank you,
>> Ivan V.
>>
>>
>> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]> 
>> napisao je:
>>>
>>> Ben, Linas,
>>>
>>> >But we know that MST parsing is shit.  Stop wasting time on MST or trying 
>>> >to "improve" it.
>>>
>>> I think that sounds like kind of support for the concept of "dumb explosive 
>>> parsing" being advocated for 1+ year ago:
>>>
>>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy
>>>
>>> I also agree we other Linas'es reasoning in this thread. I would consider 
>>> giving it a try starting next month if we don't have a breakthrough with 
>>> DNN-MI-milking-based-MST-Parsing by that time.
>>>
>>> > can be done generically, and not just on language
>>>
>>> I think everyone in bio-informatics dreams of extracting secrets of "dark 
>>> side of the genome" with something like that ;-)
>>>
>>> Cheers,
>>>
>>> -Anton
>>>
>>>
>>> 28.03.2019 1:24, Linas Vepstas пишет:
>>>
>>> Hi Anton,
>>>
>>> I've cc'ed the link-grammar mailing list, because I describe below some 
>>> concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing 
>>> list and ivan vodisek, because after studying hilbert systems, I think he's 
>>> ready to think about how knowledge extraction can be done generically, and 
>>> not just on language.
>>>
>>> -- Linas
>>>
>>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]> 
>>> wrote:
>>>>
>>>> Hi Linas,
>>>>
>>>> >I'd call it "interesting", but maybe not "golden"
>>>>
>>>> These are randomly selected sentences from "Gutenberg Children" corpus:
>>>>
>>>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>>>>
>>>> "Gutenberg Children silver standard" is LG-English parses:
>>>>
>>>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>>>>
>>>> "Gutenberg Children gold standard" is subset of "silver standard" with 
>>>> semi-random selection of sentences skipping direct speech and doing manual 
>>>> verification of the links.
>>>>
>>>> So as long as we are training on "Gutenberg Children" corpus, having the 
>>>> test on the same "Gutenberg Children" seems reasonable, right?
>>>
>>>
>>> Yes. You still need to verify that each word in the "golden" corpus occurs 
>>> at least N=10 or 20 times in the training corpus. The dependency of 
>>> accuracy on N is not generally known, but it is very clear that if a word 
>>> occurs only N=3 times in the training corpus, then whatever is learned 
>>> about it will be very low quality.
>>>
>>>>
>>>> But thanks, we may have put mire effort in removal of ancient 
>>>> constructions and words even if these are present in the corpus.
>>>
>>> If you consistently train on 19th century literature, and then evaluate 
>>> 19th-century literature comprehension, that's fine.  Just don't expect it 
>>> to work for 21st century blog posts.
>>>
>>> The strongest effect will be the N=number of observations effect.
>>>
>>>>
>>>> >Anyway -- you only indicate pair-wise word-links. Is the omission of 
>>>> >disjuncts intentional?
>>>>
>>>> If you have all links in the sentence, you can construct all of the 
>>>> disjuncts with o ambiguity, correct?
>>>
>>> No, but only because you did not indicate the link-type.  The whole point 
>>> of a clustering step is to obtain a link-type; if you discard it, you will 
>>> never get  better-than-MST results. The link-type is critical for obtaining 
>>> the word-classes.  The whole point of learning is to learn the 
>>> word-classes; you've learned very little, if you know only word-pairs.
>>>
>>> Consider this example:
>>>
>>> I saw wood
>>> I saw some wood
>>>
>>> A solution that would be "almost perfect" (or "golden") would be this:
>>>
>>> saw: {performer-of-actions}- & {sculptable-mass}+;
>>> saw: {observer}-  & {viewable-thing}+;
>>>
>>> These disambiguate the two different senses of the word "saw".  It's 
>>> impossible to have word-sense disambiguation without actually having these 
>>> disjuncts.  The word-pairs alone are not sufficient to report the link-type 
>>> connecting the words.  Clustering gives the other dictionary entries:
>>>
>>> I: {performer-of-actions}+ or {observer}+;
>>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
>>> some: {quantity-determiner}+;
>>>
>>> Thus, the pronoun "I" also belong to two different word-sense categories: 
>>> performers and observers.  Compare to:
>>>
>>> "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of actions" 
>>> but cannot be an "observer".
>>> "The dog saw some wood" -- dogs can be observers. They can perform some 
>>> actions; like run, jump, but they cannot saw, hammer, cut, stab.
>>>
>>> The link-type is absolutely crucial to understanding a word.  The 
>>> language-learning project is all about learning the link-types. Without 
>>> correct link-type assignments, you cannot have correct parses.
>>>
>>> ... which is 100% of the problem with MST.  The problem with MST is not so 
>>> much that "its not accurate" -sure, it is not terribly accurate. But even 
>>> if MST or some MST-replacement was 100% accurate, it would still be "wrong" 
>>> because it fails to indicate the link-type.  If you want to understand a 
>>> sentence, you MUST know the link-types!
>>>
>>> Otherwise, you just have "green ideas sleep furiously", which parses, but 
>>> only because the link types have been erased, or made stupid.  Here's a 
>>> stupid grammar:
>>>
>>> ideas:  {adjective}- & {verb}+;
>>> green: {adjective}+;
>>>
>>> which allows "green ideas" to parse.  But of course, this is wrong; it 
>>> should have been:
>>>
>>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
>>> green: {physical-object-modifier}+;
>>>
>>> and now it is clear that "green ideas" cannot parse, because the link-types 
>>> clash.
>>>
>>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you 
>>> will get very low quality grammars.
>>>
>>> * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This 
>>> is what deep-learning/neural-nets do: this is why the deep-learning systems 
>>> seem to give nice results: 200 or 300 features is enough to start having 
>>> adequate functional distinctions (e.g. the famous "king - 
>>> male+female=queen" example, or "paris-france+germany=berlin" example)
>>>
>>> * If you cluster to 3K to 8K clusters, you start having a quite decent 
>>> model of language
>>>
>>> * Note that wordnet has 117K "synsets".
>>>
>>> Note that in the above example:
>>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
>>>
>>> the things in the curly-braces are effectively "synsets".
>>>
>>> The next set of goal-posts is to have disjuncts, of maybe low-medium 
>>> quality, and use these to extract ontologies.  e.g.
>>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}
>>>
>>> You can try to do this by clustering but there are probably better ways of 
>>> discovering ontology.
>>>
>>>
>>>>
>>>> >Also -- no hint of any word-classes or part-of-speech tagging? This is 
>>>> >surely important to evaluate as well, or is this to be done in some other 
>>>> >way?  i.e. to evaluate if "Pivi" was correctly clustered with other given 
>>>> >names?  Or that lama/llama was clustered with other four-legged animals?
>>>>
>>>> We don't have that in MST-Parsing, right? We need this corpus to assess 
>>>> the quality of the MST-Parsing so we don't need part-of-speech information 
>>>> for that.
>>>
>>> But we know that MST parsing is shit.  Stop wasting time on MST or trying 
>>> to "improve" it. We already know that it is close to a high-entropy path to 
>>> structure; trying to squeeze a few more percent of entropy is not worth the 
>>> effort, not at this time.  Focus on finding a high-entropy structure 
>>> extraction algorithm, don't waste time on MST.
>>>
>>> You should be focusing on extracting disjuncts, word-classes, word-senses, 
>>> and trying to improve the quality of those.  If you obtain a high-entropy 
>>> path to these structures, the quality of your parses will automatically 
>>> improve.  Focus on the entropy numbers. Try to maximize that.
>>>
>>>> The clustering is able to do that anyway - see the graphs in the end of 
>>>> the last year report:
>>>>
>>>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>>>>
>>>> >Also -- I can't tell -- is it free of loops, or are loops allowed?  
>>>> >Allowing loops tends to provide stronger, more accurate parses.  Loops 
>>>> >act as constraints.
>>>>
>>>> The loops and crossing links are not allowed in the MST-Parser now. If we 
>>>> allow them in the test corpus, how could it make assessment of MST-Parses 
>>>> better?
>>>>
>>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's 
>>>> directions.
>>>
>>>
>>> Not to say bad things about Ben, but I'm certain he has not actually 
>>> thought about this problem very much. He is very very busy doing other 
>>> things; he is not thinking about this stuff.  I have repeatedly tried to 
>>> explain the issues to him, and its quite clear that he is far away from 
>>> understanding them, from working at the level that I would like to have you 
>>> and your team work at.
>>>
>>> I'm trying to have you make small, quantified baby-steps, to verify the 
>>> accuracy of your methods and data.  What I'm seeing is that you are 
>>> attempting to make giant-steps, without verification, and then getting 
>>> low-quality results, without understanding the root causes for them.  You 
>>> can't dig yourself out of a ditch, and digging harder and more furiously 
>>> won't raise the accuracy of the parse results.
>>>
>>> --linas
>>>
>>>> We have your MST-Parser-less idea on the map but we are NOT trying it now:
>>>>
>>>> https://github.com/singnet/language-learning/issues/170
>>>>
>>>> We may try it after we explore the account for costs
>>>>
>>>> https://github.com/singnet/language-learning/issues/183
>>>>
>>>> Thanks,
>>>>
>>>> -Anton
>>>>
>>>> 24.03.2019 9:24, Linas Vepstas пишет:
>>>>
>>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the 
>>>> knob, trembling like a leaf." correctly. It is one of a class of sentences 
>>>> it does not know about.  Which is maybe OK, because ideally, the learned 
>>>> grammar will be able to do this. But today, LG cannot.
>>>>
>>>> --linas
>>>>
>>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]> 
>>>> wrote:
>>>>>
>>>>> Anton,
>>>>>
>>>>> It's certainly an unusual corpus, and it might give you rather low 
>>>>> scores. I'd call it "interesting", but maybe not "golden". Although I 
>>>>> suppose it depends on your training corpus.  Here are some problems that 
>>>>> pop out:
>>>>>
>>>>> First sentence --
>>>>> "the old beast was whinnying on his shoulder" -- the word "whinnying" is 
>>>>> a fairly rare English verb -- you could read half-a-million wikipedia 
>>>>> articles, and not see it once. You could read lots of 19th-century or 
>>>>> early-20th century cowboy/adventure novels, (like what you'd find on 
>>>>> Project Gutenberg) and maybe see it some fair amount. Even then -- to 
>>>>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? 
>>>>> How often does that happen, in any cowboy novel? "to whinny on something" 
>>>>> is an extremely rare construction.  It will work only if you've correctly 
>>>>> categorized "whinny" as a verb that can take a preposition.  Are your 
>>>>> clustering algos that good, yet, to correctly cluster rare words into 
>>>>> appropriate verb categories?
>>>>>
>>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never 
>>>>> heard of it as a name before.  Your training data is going to be 
>>>>> extremely slim on this. And lack of training data means poor statistics, 
>>>>> which means low scores.  Unless -- again, your clustering code is good 
>>>>> enough to place "Jims" in a "proper name" cluster...
>>>>>
>>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost 
>>>>> archaic verb. These days, everyone spells llama with two ll's not one. 
>>>>> Unless your talking about Buddhist monks, its a typo.
>>>>>
>>>>> "you understand?"  is .. awkward. Common in speech, uncommon in writing. 
>>>>> Unlikely that you'll have enough training data for this.
>>>>>
>>>>> "Willard" is an uncommon name. Does your training corp[us have a 
>>>>> sufficient number of mentions of Willard? Do you have clustering working 
>>>>> well enough to stick "Willard" into a cluster with other names?
>>>>>
>>>>> "it is so with Sammy Jay" is clearly archaic English.
>>>>>
>>>>> "he hasn't any relations here" is clearly archaic, an olde-fashioned 
>>>>> construction.
>>>>>
>>>>> "Pivi said not one word" - again, a clearly old-fashioned construction. 
>>>>> Does the training set contain enough examples of "Pivi" to recognize it 
>>>>> as a name? Are names clustering correctly?
>>>>>
>>>>> Any sentence with an inversion is going to sound old-fashioned. All of 
>>>>> the sentences in that corpus sound old-fashioned. Which maybe is OK if 
>>>>> you are training on 19th century Gutenberg texts .. but its certainly not 
>>>>> modern English.  Even when I was a child, and I read those old 
>>>>> crumbly-yellow paper adventure books, part of the fun was that no one 
>>>>> actually talked that way -- not at school, not at home, not on TV. It was 
>>>>> clearly from a different time and place -- an adventure.
>>>>>
>>>>> Anyway -- you only indicate pair-wise word-links. Is the omission of 
>>>>> disjuncts intentional? Also -- no hint of any word-classes or 
>>>>> part-of-speech tagging? This is surely important to evaluate as well, or 
>>>>> is this to be done in some other way?  i.e. to evaluate if "Pivi" was 
>>>>> correctly clustered with other given names?  Or that lama/llama was 
>>>>> clustered with other four-legged animals?
>>>>>
>>>>> Also -- I can't tell -- is it free of loops, or are loops allowed?  
>>>>> Allowing loops tends to provide stronger, more accurate parses.  Loops 
>>>>> act as constraints.
>>>>>
>>>>> -- Linas
>>>>>
>>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail 
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi Linas, Andes and whoever understands LG and English well enough both.
>>>>>>
>>>>>> Attached are first 100 sentences for GC "gold standard" - manually 
>>>>>> checked based on LG parses.
>>>>>>
>>>>>> We are expecting more to come in the next two weeks.
>>>>>>
>>>>>> To enable that, please have cursory review of the corpus and let us know 
>>>>>> if there are corrections still needed so your corrections will be used 
>>>>>> as a reference to fix the rest and keep going further.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> -Anton
>>>>>>
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "lang-learn" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>> an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> cassette tapes - analog TV - film cameras - you
>>>>
>>>>
>>>>
>>>> --
>>>> cassette tapes - analog TV - film cameras - you
>>>>
>>>> --
>>>> -Anton Kolonin
>>>> skype: akolonin
>>>> cell: +79139250058
>>>> [email protected]
>>>> https://aigents.com
>>>> https://www.youtube.com/aigents
>>>> https://www.facebook.com/aigents
>>>> https://medium.com/@aigents
>>>> https://steemit.com/@aigents
>>>> https://golos.blog/@aigents
>>>> https://vk.com/aigents
>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "lang-learn" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> -Anton Kolonin
>>> skype: akolonin
>>> cell: +79139250058
>>> [email protected]
>>> https://aigents.com
>>> https://www.youtube.com/aigents
>>> https://www.facebook.com/aigents
>>> https://medium.com/@aigents
>>> https://steemit.com/@aigents
>>> https://golos.blog/@aigents
>>> https://vk.com/aigents
>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: +79139250058
> [email protected]
> https://aigents.com
> https://www.youtube.com/aigents
> https://www.facebook.com/aigents
> https://medium.com/@aigents
> https://steemit.com/@aigents
> https://golos.blog/@aigents
> https://vk.com/aigents



-- 
Ben Goertzel, PhD
http://goertzel.org

"Listen: This world is the lunatic's sphere,  /  Don't always agree
it's real.  /  Even with my feet upon it / And the postman knowing my
door / My address is somewhere else." -- Hafiz

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBcfJ9z-5eqFFPWH%2Bpx13d0ax9nGPdEeP1GGb8M_0xk9MQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to