[opencog-dev] Re: 100 sentences for GC

Ivan V. Thu, 28 Mar 2019 09:31:43 -0700

P.S.

Just to adjust the pitch of my reply, automatic extraction of link grammar
rules is more than excellent achievement already. Congratulations on this
important piece of the puzzle!


Keep up the good work,
Ivan V.


čet, 28. ožu 2019. u 16:21 Ivan V. <[email protected]> napisao je:

> Linas Vepstas wrote:
>
> >... knowledge extraction can be done generically, and not just on
> language.
>
> If link grammar would be Turing complete, this might be possible right
> away. But somehow, I suspect... Isn't this why OpenCog has "unified rule
> engine" (URE) instead of link grammar at its core, and with URE things get
> much more complicated. I'm sorry, but that is still a Gordian knot to me,
> considering all of my modest knowledge. On the other hand, if someone
> really smart would provide automatic grammar extraction by means of 
> unrestricted
> grammar <https://en.wikipedia.org/wiki/Unrestricted_grammar>, I believe
> that would be it.
>
> Thank you,
> Ivan V.
>
>
> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]>
> napisao je:
>
>> Ben, Linas,
>>
>> >But we know that MST parsing is shit.  Stop wasting time on MST or
>> trying to "improve" it.
>>
>> I think that sounds like kind of support for the concept of "dumb
>> explosive parsing" being advocated for 1+ year ago:
>>
>>
>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy
>>
>> I also agree we other Linas'es reasoning in this thread. I would consider
>> giving it a try starting next month if we don't have a breakthrough with
>> DNN-MI-milking-based-MST-Parsing by that time.
>>
>> > can be done generically, and not just on language
>>
>> I think everyone in bio-informatics dreams of extracting secrets of "dark
>> side of the genome" with something like that ;-)
>>
>> Cheers,
>>
>> -Anton
>>
>>
>> 28.03.2019 1:24, Linas Vepstas пишет:
>>
>> Hi Anton,
>>
>> I've cc'ed the link-grammar mailing list, because I describe below some
>> concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing
>> list and ivan vodisek, because after studying hilbert systems, I think he's
>> ready to think about how knowledge extraction can be done generically, and
>> not just on language.
>>
>> -- Linas
>>
>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]>
>> wrote:
>>
>>> Hi Linas,
>>>
>>> >I'd call it "interesting", but maybe not "golden"
>>>
>>> These are randomly selected sentences from "Gutenberg Children" corpus:
>>>
>>>
>>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>>>
>>> "Gutenberg Children silver standard" is LG-English parses:
>>>
>>>
>>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>>>
>>> "Gutenberg Children gold standard" is subset of "silver standard" with
>>> semi-random selection of sentences skipping direct speech and doing manual
>>> verification of the links.
>>>
>>> So as long as we are training on "Gutenberg Children" corpus, having the
>>> test on the same "Gutenberg Children" seems reasonable, right?
>>>
>>
>> Yes. You still need to verify that each word in the "golden" corpus
>> occurs at least N=10 or 20 times in the training corpus. The dependency of
>> accuracy on N is not generally known, but it is very clear that if a word
>> occurs only N=3 times in the training corpus, then whatever is learned
>> about it will be very low quality.
>>
>>
>>> But thanks, we may have put mire effort in removal of ancient
>>> constructions and words even if these are present in the corpus.
>>>
>> If you consistently train on 19th century literature, and then evaluate
>> 19th-century literature comprehension, that's fine.  Just don't expect it
>> to work for 21st century blog posts.
>>
>> The strongest effect will be the N=number of observations effect.
>>
>>
>>> >Anyway -- you only indicate pair-wise word-links. Is the omission of
>>> disjuncts intentional?
>>>
>>> If you have all links in the sentence, you can construct all of the
>>> disjuncts with o ambiguity, correct?
>>>
>> No, but only because you did not indicate the link-type.  The whole point
>> of a clustering step is to obtain a link-type; if you discard it, you will
>> never get  better-than-MST results. The link-type is critical for obtaining
>> the word-classes.  The whole point of learning is to learn the
>> word-classes; you've learned very little, if you know only word-pairs.
>>
>> Consider this example:
>>
>> I saw wood
>> I saw some wood
>>
>> A solution that would be "almost perfect" (or "golden") would be this:
>>
>> saw: {performer-of-actions}- & {sculptable-mass}+;
>> saw: {observer}-  & {viewable-thing}+;
>>
>> These disambiguate the two different senses of the word "saw".  It's
>> impossible to have word-sense disambiguation without actually having these
>> disjuncts.  The word-pairs alone are not sufficient to report the link-type
>> connecting the words.  Clustering gives the other dictionary entries:
>>
>> I: {performer-of-actions}+ or {observer}+;
>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
>> some: {quantity-determiner}+;
>>
>> Thus, the pronoun "I" also belong to two different word-sense categories:
>> performers and observers.  Compare to:
>>
>> "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of
>> actions" but cannot be an "observer".
>> "The dog saw some wood" -- dogs can be observers. They can perform some
>> actions; like run, jump, but they cannot saw, hammer, cut, stab.
>>
>> The link-type is absolutely crucial to understanding a word.  The
>> language-learning project is all about learning the link-types. Without
>> correct link-type assignments, you cannot have correct parses.
>>
>> ... which is 100% of the problem with MST.  The problem with MST is not
>> so much that "its not accurate" -sure, it is not terribly accurate. But
>> even if MST or some MST-replacement was 100% accurate, it would still be
>> "wrong" because it fails to indicate the link-type.  If you want to
>> understand a sentence, you MUST know the link-types!
>>
>> Otherwise, you just have "green ideas sleep furiously", which parses, but
>> only because the link types have been erased, or made stupid.  Here's a
>> stupid grammar:
>>
>> ideas:  {adjective}- & {verb}+;
>> green: {adjective}+;
>>
>> which allows "green ideas" to parse.  But of course, this is wrong; it
>> should have been:
>>
>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
>> green: {physical-object-modifier}+;
>>
>> and now it is clear that "green ideas" cannot parse, because the
>> link-types clash.
>>
>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you
>> will get very low quality grammars.
>>
>> * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars.
>> This is what deep-learning/neural-nets do: this is why the deep-learning
>> systems seem to give nice results: 200 or 300 features is enough to start
>> having adequate functional distinctions (e.g. the famous "king -
>> male+female=queen" example, or "paris-france+germany=berlin" example)
>>
>> * If you cluster to 3K to 8K clusters, you start having a quite decent
>> model of language
>>
>> * Note that wordnet has 117K "synsets".
>>
>> Note that in the above example:
>> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
>>
>> the things in the curly-braces are effectively "synsets".
>>
>> The next set of goal-posts is to have disjuncts, of maybe low-medium
>> quality, and use these to extract ontologies.  e.g.
>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}
>>
>> You can try to do this by clustering but there are probably better ways
>> of discovering ontology.
>>
>>
>>
>>> >Also -- no hint of any word-classes or part-of-speech tagging? This is
>>> surely important to evaluate as well, or is this to be done in some other
>>> way?  i.e. to evaluate if "Pivi" was correctly clustered with other given
>>> names?  Or that lama/llama was clustered with other four-legged animals?
>>>
>>> We don't have that in MST-Parsing, right? We need this corpus to assess
>>> the quality of the MST-Parsing so we don't need part-of-speech information
>>> for that.
>>>
>> But we know that MST parsing is shit.  Stop wasting time on MST or trying
>> to "improve" it. We already know that it is close to a high-entropy path to
>> structure; trying to squeeze a few more percent of entropy is not worth the
>> effort, not at this time.  Focus on finding a high-entropy structure
>> extraction algorithm, don't waste time on MST.
>>
>> You should be focusing on extracting disjuncts, word-classes,
>> word-senses, and trying to improve the quality of those.  If you obtain a
>> high-entropy path to these structures, the quality of your parses will
>> automatically improve.  Focus on the entropy numbers. Try to maximize that.
>>
>> The clustering is able to do that anyway - see the graphs in the end of
>>> the last year report:
>>>
>>>
>>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>>>
>>> >Also -- I can't tell -- is it free of loops, or are loops allowed?
>>> Allowing loops tends to provide stronger, more accurate parses.  Loops act
>>> as constraints.
>>>
>>> The loops and crossing links are not allowed in the MST-Parser now. If
>>> we allow them in the test corpus, how could it make assessment of
>>> MST-Parses better?
>>>
>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's
>>> directions.
>>>
>>
>> Not to say bad things about Ben, but I'm certain he has not actually
>> thought about this problem very much. He is very very busy doing other
>> things; he is not thinking about this stuff.  I have repeatedly tried to
>> explain the issues to him, and its quite clear that he is far away from
>> understanding them, from working at the level that I would like to have you
>> and your team work at.
>>
>> I'm trying to have you make small, quantified baby-steps, to verify the
>> accuracy of your methods and data.  What I'm seeing is that you are
>> attempting to make giant-steps, without verification, and then getting
>> low-quality results, without understanding the root causes for them.  You
>> can't dig yourself out of a ditch, and digging harder and more furiously
>> won't raise the accuracy of the parse results.
>>
>> --linas
>>
>> We have your MST-Parser-less idea on the map but we are NOT trying it now:
>>>
>>> https://github.com/singnet/language-learning/issues/170
>>>
>>> We may try it after we explore the account for costs
>>>
>>> https://github.com/singnet/language-learning/issues/183
>>>
>>> Thanks,
>>>
>>> -Anton
>>> 24.03.2019 9:24, Linas Vepstas пишет:
>>>
>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the
>>> knob, trembling like a leaf." correctly. It is one of a class of sentences
>>> it does not know about.  Which is maybe OK, because ideally, the learned
>>> grammar will be able to do this. But today, LG cannot.
>>>
>>> --linas
>>>
>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]>
>>> wrote:
>>>
>>>> Anton,
>>>>
>>>> It's certainly an unusual corpus, and it might give you rather low
>>>> scores. I'd call it "interesting", but maybe not "golden". Although I
>>>> suppose it depends on your training corpus.  Here are some problems that
>>>> pop out:
>>>>
>>>> First sentence --
>>>> "the old beast was whinnying on his shoulder" -- the word "whinnying"
>>>> is a fairly rare English verb -- you could read half-a-million wikipedia
>>>> articles, and not see it once. You could read lots of 19th-century or
>>>> early-20th century cowboy/adventure novels, (like what you'd find on
>>>> Project Gutenberg) and maybe see it some fair amount. Even then -- to
>>>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How
>>>> often does that happen, in any cowboy novel? "to whinny on something" is an
>>>> extremely rare construction.  It will work only if you've correctly
>>>> categorized "whinny" as a verb that can take a preposition.  Are your
>>>> clustering algos that good, yet, to correctly cluster rare words into
>>>> appropriate verb categories?
>>>>
>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never
>>>> heard of it as a name before.  Your training data is going to be extremely
>>>> slim on this. And lack of training data means poor statistics, which means
>>>> low scores.  Unless -- again, your clustering code is good enough to place
>>>> "Jims" in a "proper name" cluster...
>>>>
>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost
>>>> archaic verb. These days, everyone spells llama with two ll's not one.
>>>> Unless your talking about Buddhist monks, its a typo.
>>>>
>>>> "you understand?"  is .. awkward. Common in speech, uncommon in
>>>> writing. Unlikely that you'll have enough training data for this.
>>>>
>>>> "Willard" is an uncommon name. Does your training corp[us have a
>>>> sufficient number of mentions of Willard? Do you have clustering working
>>>> well enough to stick "Willard" into a cluster with other names?
>>>>
>>>> "it is so with Sammy Jay" is clearly archaic English.
>>>>
>>>> "he hasn't any relations here" is clearly archaic, an olde-fashioned
>>>> construction.
>>>>
>>>> "Pivi said not one word" - again, a clearly old-fashioned construction.
>>>> Does the training set contain enough examples of "Pivi" to recognize it as
>>>> a name? Are names clustering correctly?
>>>>
>>>> Any sentence with an inversion is going to sound old-fashioned. All of
>>>> the sentences in that corpus sound old-fashioned. Which maybe is OK if you
>>>> are training on 19th century Gutenberg texts .. but its certainly not
>>>> modern English.  Even when I was a child, and I read those old
>>>> crumbly-yellow paper adventure books, part of the fun was that no one
>>>> actually talked that way -- not at school, not at home, not on TV. It was
>>>> clearly from a different time and place -- an adventure.
>>>>
>>>> Anyway -- you only indicate pair-wise word-links. Is the omission of
>>>> disjuncts intentional? Also -- no hint of any word-classes or
>>>> part-of-speech tagging? This is surely important to evaluate as well, or is
>>>> this to be done in some other way?  i.e. to evaluate if "Pivi" was
>>>> correctly clustered with other given names?  Or that lama/llama was
>>>> clustered with other four-legged animals?
>>>>
>>>> Also -- I can't tell -- is it free of loops, or are loops allowed?
>>>> Allowing loops tends to provide stronger, more accurate parses.  Loops act
>>>> as constraints.
>>>>
>>>> -- Linas
>>>>
>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Linas, Andes and whoever understands LG and English well enough
>>>>> both.
>>>>>
>>>>> Attached are first 100 sentences for GC "gold standard" - manually
>>>>> checked based on LG parses.
>>>>>
>>>>> We are expecting more to come in the next two weeks.
>>>>>
>>>>> To enable that, please have cursory review of the corpus and let us
>>>>> know if there are corrections still needed so your corrections will be 
>>>>> used
>>>>> as a reference to fix the rest and keep going further.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -Anton
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "lang-learn" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
>>>>> <https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>> cassette tapes - analog TV - film cameras - you
>>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>> --
>>> -Anton Kolonin
>>> skype: akolonin
>>> cell: 
>>> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>>>
>>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>> --
>> You received this message because you are subscribed to the Google Groups
>> "lang-learn" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> -Anton Kolonin
>> skype: akolonin
>> cell: 
>> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAB5%3Dj6VDYXPokobPSO2ZqTO1aBVPq7s1n20hvo2P5eYX50Zr7Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to