[opencog-dev] Re: 100 sentences for GC

Ivan V. Thu, 28 Mar 2019 08:22:22 -0700

Linas Vepstas wrote:

>... knowledge extraction can be done generically, and not just on language.


If link grammar would be Turing complete, this might be possible right
away. But somehow, I suspect... Isn't this why OpenCog has "unified rule
engine" (URE) instead of link grammar at its core, and with URE things get
much more complicated. I'm sorry, but that is still a Gordian knot to me,
considering all of my modest knowledge. On the other hand, if someone
really smart would provide automatic grammar extraction by means of
unrestricted
grammar <https://en.wikipedia.org/wiki/Unrestricted_grammar>, I believe
that would be it.

Thank you,
Ivan V.


čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]>
napisao je:

> Ben, Linas,
>
> >But we know that MST parsing is shit.  Stop wasting time on MST or trying
> to "improve" it.
>
> I think that sounds like kind of support for the concept of "dumb
> explosive parsing" being advocated for 1+ year ago:
>
>
> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy
>
> I also agree we other Linas'es reasoning in this thread. I would consider
> giving it a try starting next month if we don't have a breakthrough with
> DNN-MI-milking-based-MST-Parsing by that time.
>
> > can be done generically, and not just on language
>
> I think everyone in bio-informatics dreams of extracting secrets of "dark
> side of the genome" with something like that ;-)
>
> Cheers,
>
> -Anton
>
>
> 28.03.2019 1:24, Linas Vepstas пишет:
>
> Hi Anton,
>
> I've cc'ed the link-grammar mailing list, because I describe below some
> concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing
> list and ivan vodisek, because after studying hilbert systems, I think he's
> ready to think about how knowledge extraction can be done generically, and
> not just on language.
>
> -- Linas
>
> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]>
> wrote:
>
>> Hi Linas,
>>
>> >I'd call it "interesting", but maybe not "golden"
>>
>> These are randomly selected sentences from "Gutenberg Children" corpus:
>>
>>
>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>>
>> "Gutenberg Children silver standard" is LG-English parses:
>>
>>
>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>>
>> "Gutenberg Children gold standard" is subset of "silver standard" with
>> semi-random selection of sentences skipping direct speech and doing manual
>> verification of the links.
>>
>> So as long as we are training on "Gutenberg Children" corpus, having the
>> test on the same "Gutenberg Children" seems reasonable, right?
>>
>
> Yes. You still need to verify that each word in the "golden" corpus occurs
> at least N=10 or 20 times in the training corpus. The dependency of
> accuracy on N is not generally known, but it is very clear that if a word
> occurs only N=3 times in the training corpus, then whatever is learned
> about it will be very low quality.
>
>
>> But thanks, we may have put mire effort in removal of ancient
>> constructions and words even if these are present in the corpus.
>>
> If you consistently train on 19th century literature, and then evaluate
> 19th-century literature comprehension, that's fine.  Just don't expect it
> to work for 21st century blog posts.
>
> The strongest effect will be the N=number of observations effect.
>
>
>> >Anyway -- you only indicate pair-wise word-links. Is the omission of
>> disjuncts intentional?
>>
>> If you have all links in the sentence, you can construct all of the
>> disjuncts with o ambiguity, correct?
>>
> No, but only because you did not indicate the link-type.  The whole point
> of a clustering step is to obtain a link-type; if you discard it, you will
> never get  better-than-MST results. The link-type is critical for obtaining
> the word-classes.  The whole point of learning is to learn the
> word-classes; you've learned very little, if you know only word-pairs.
>
> Consider this example:
>
> I saw wood
> I saw some wood
>
> A solution that would be "almost perfect" (or "golden") would be this:
>
> saw: {performer-of-actions}- & {sculptable-mass}+;
> saw: {observer}-  & {viewable-thing}+;
>
> These disambiguate the two different senses of the word "saw".  It's
> impossible to have word-sense disambiguation without actually having these
> disjuncts.  The word-pairs alone are not sufficient to report the link-type
> connecting the words.  Clustering gives the other dictionary entries:
>
> I: {performer-of-actions}+ or {observer}+;
> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
> some: {quantity-determiner}+;
>
> Thus, the pronoun "I" also belong to two different word-sense categories:
> performers and observers.  Compare to:
>
> "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of actions"
> but cannot be an "observer".
> "The dog saw some wood" -- dogs can be observers. They can perform some
> actions; like run, jump, but they cannot saw, hammer, cut, stab.
>
> The link-type is absolutely crucial to understanding a word.  The
> language-learning project is all about learning the link-types. Without
> correct link-type assignments, you cannot have correct parses.
>
> ... which is 100% of the problem with MST.  The problem with MST is not so
> much that "its not accurate" -sure, it is not terribly accurate. But even
> if MST or some MST-replacement was 100% accurate, it would still be "wrong"
> because it fails to indicate the link-type.  If you want to understand a
> sentence, you MUST know the link-types!
>
> Otherwise, you just have "green ideas sleep furiously", which parses, but
> only because the link types have been erased, or made stupid.  Here's a
> stupid grammar:
>
> ideas:  {adjective}- & {verb}+;
> green: {adjective}+;
>
> which allows "green ideas" to parse.  But of course, this is wrong; it
> should have been:
>
> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
> green: {physical-object-modifier}+;
>
> and now it is clear that "green ideas" cannot parse, because the
> link-types clash.
>
> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you
> will get very low quality grammars.
>
> * If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This
> is what deep-learning/neural-nets do: this is why the deep-learning systems
> seem to give nice results: 200 or 300 features is enough to start having
> adequate functional distinctions (e.g. the famous "king -
> male+female=queen" example, or "paris-france+germany=berlin" example)
>
> * If you cluster to 3K to 8K clusters, you start having a quite decent
> model of language
>
> * Note that wordnet has 117K "synsets".
>
> Note that in the above example:
> wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
>
> the things in the curly-braces are effectively "synsets".
>
> The next set of goal-posts is to have disjuncts, of maybe low-medium
> quality, and use these to extract ontologies.  e.g.
> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}
>
> You can try to do this by clustering but there are probably better ways of
> discovering ontology.
>
>
>
>> >Also -- no hint of any word-classes or part-of-speech tagging? This is
>> surely important to evaluate as well, or is this to be done in some other
>> way?  i.e. to evaluate if "Pivi" was correctly clustered with other given
>> names?  Or that lama/llama was clustered with other four-legged animals?
>>
>> We don't have that in MST-Parsing, right? We need this corpus to assess
>> the quality of the MST-Parsing so we don't need part-of-speech information
>> for that.
>>
> But we know that MST parsing is shit.  Stop wasting time on MST or trying
> to "improve" it. We already know that it is close to a high-entropy path to
> structure; trying to squeeze a few more percent of entropy is not worth the
> effort, not at this time.  Focus on finding a high-entropy structure
> extraction algorithm, don't waste time on MST.
>
> You should be focusing on extracting disjuncts, word-classes, word-senses,
> and trying to improve the quality of those.  If you obtain a high-entropy
> path to these structures, the quality of your parses will automatically
> improve.  Focus on the entropy numbers. Try to maximize that.
>
> The clustering is able to do that anyway - see the graphs in the end of
>> the last year report:
>>
>>
>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>>
>> >Also -- I can't tell -- is it free of loops, or are loops allowed?
>> Allowing loops tends to provide stronger, more accurate parses.  Loops act
>> as constraints.
>>
>> The loops and crossing links are not allowed in the MST-Parser now. If we
>> allow them in the test corpus, how could it make assessment of MST-Parses
>> better?
>>
>> Note, that we ARE working we MST-Parses now - accordingly to Ben's
>> directions.
>>
>
> Not to say bad things about Ben, but I'm certain he has not actually
> thought about this problem very much. He is very very busy doing other
> things; he is not thinking about this stuff.  I have repeatedly tried to
> explain the issues to him, and its quite clear that he is far away from
> understanding them, from working at the level that I would like to have you
> and your team work at.
>
> I'm trying to have you make small, quantified baby-steps, to verify the
> accuracy of your methods and data.  What I'm seeing is that you are
> attempting to make giant-steps, without verification, and then getting
> low-quality results, without understanding the root causes for them.  You
> can't dig yourself out of a ditch, and digging harder and more furiously
> won't raise the accuracy of the parse results.
>
> --linas
>
> We have your MST-Parser-less idea on the map but we are NOT trying it now:
>>
>> https://github.com/singnet/language-learning/issues/170
>>
>> We may try it after we explore the account for costs
>>
>> https://github.com/singnet/language-learning/issues/183
>>
>> Thanks,
>>
>> -Anton
>> 24.03.2019 9:24, Linas Vepstas пишет:
>>
>> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the
>> knob, trembling like a leaf." correctly. It is one of a class of sentences
>> it does not know about.  Which is maybe OK, because ideally, the learned
>> grammar will be able to do this. But today, LG cannot.
>>
>> --linas
>>
>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]>
>> wrote:
>>
>>> Anton,
>>>
>>> It's certainly an unusual corpus, and it might give you rather low
>>> scores. I'd call it "interesting", but maybe not "golden". Although I
>>> suppose it depends on your training corpus.  Here are some problems that
>>> pop out:
>>>
>>> First sentence --
>>> "the old beast was whinnying on his shoulder" -- the word "whinnying" is
>>> a fairly rare English verb -- you could read half-a-million wikipedia
>>> articles, and not see it once. You could read lots of 19th-century or
>>> early-20th century cowboy/adventure novels, (like what you'd find on
>>> Project Gutenberg) and maybe see it some fair amount. Even then -- to
>>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How
>>> often does that happen, in any cowboy novel? "to whinny on something" is an
>>> extremely rare construction.  It will work only if you've correctly
>>> categorized "whinny" as a verb that can take a preposition.  Are your
>>> clustering algos that good, yet, to correctly cluster rare words into
>>> appropriate verb categories?
>>>
>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never
>>> heard of it as a name before.  Your training data is going to be extremely
>>> slim on this. And lack of training data means poor statistics, which means
>>> low scores.  Unless -- again, your clustering code is good enough to place
>>> "Jims" in a "proper name" cluster...
>>>
>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost
>>> archaic verb. These days, everyone spells llama with two ll's not one.
>>> Unless your talking about Buddhist monks, its a typo.
>>>
>>> "you understand?"  is .. awkward. Common in speech, uncommon in writing.
>>> Unlikely that you'll have enough training data for this.
>>>
>>> "Willard" is an uncommon name. Does your training corp[us have a
>>> sufficient number of mentions of Willard? Do you have clustering working
>>> well enough to stick "Willard" into a cluster with other names?
>>>
>>> "it is so with Sammy Jay" is clearly archaic English.
>>>
>>> "he hasn't any relations here" is clearly archaic, an olde-fashioned
>>> construction.
>>>
>>> "Pivi said not one word" - again, a clearly old-fashioned construction.
>>> Does the training set contain enough examples of "Pivi" to recognize it as
>>> a name? Are names clustering correctly?
>>>
>>> Any sentence with an inversion is going to sound old-fashioned. All of
>>> the sentences in that corpus sound old-fashioned. Which maybe is OK if you
>>> are training on 19th century Gutenberg texts .. but its certainly not
>>> modern English.  Even when I was a child, and I read those old
>>> crumbly-yellow paper adventure books, part of the fun was that no one
>>> actually talked that way -- not at school, not at home, not on TV. It was
>>> clearly from a different time and place -- an adventure.
>>>
>>> Anyway -- you only indicate pair-wise word-links. Is the omission of
>>> disjuncts intentional? Also -- no hint of any word-classes or
>>> part-of-speech tagging? This is surely important to evaluate as well, or is
>>> this to be done in some other way?  i.e. to evaluate if "Pivi" was
>>> correctly clustered with other given names?  Or that lama/llama was
>>> clustered with other four-legged animals?
>>>
>>> Also -- I can't tell -- is it free of loops, or are loops allowed?
>>> Allowing loops tends to provide stronger, more accurate parses.  Loops act
>>> as constraints.
>>>
>>> -- Linas
>>>
>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <
>>> [email protected]> wrote:
>>>
>>>> Hi Linas, Andes and whoever understands LG and English well enough both.
>>>>
>>>> Attached are first 100 sentences for GC "gold standard" - manually
>>>> checked based on LG parses.
>>>>
>>>> We are expecting more to come in the next two weeks.
>>>>
>>>> To enable that, please have cursory review of the corpus and let us
>>>> know if there are corrections still needed so your corrections will be used
>>>> as a reference to fix the rest and keep going further.
>>>>
>>>> Thank you,
>>>>
>>>> -Anton
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "lang-learn" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
>>>> <https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>> --
>> -Anton Kolonin
>> skype: akolonin
>> cell: 
>> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>>
>>
>
> --
> cassette tapes - analog TV - film cameras - you
> --
> You received this message because you are subscribed to the Google Groups
> "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAB5%3Dj6UsBGYrZtSa93Hd0PFhnmK9S7m3sCKJRAuGGABCVhhy_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to