[opencog-dev] Re: 100 sentences for GC

Linas Vepstas Wed, 27 Mar 2019 11:25:22 -0700

Hi Anton,

I've cc'ed the link-grammar mailing list, because I describe below some
concepts for word-sense disambiguation. I'm also cc'ing the opencog mailing
list and ivan vodisek, because after studying hilbert systems, I think he's
ready to think about how knowledge extraction can be done generically, and
not just on language.


-- Linas

On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <[email protected]>
wrote:

> Hi Linas,
>
> >I'd call it "interesting", but maybe not "golden"
>
> These are randomly selected sentences from "Gutenberg Children" corpus:
>
>
> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>
> "Gutenberg Children silver standard" is LG-English parses:
>
>
> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>
> "Gutenberg Children gold standard" is subset of "silver standard" with
> semi-random selection of sentences skipping direct speech and doing manual
> verification of the links.
>
> So as long as we are training on "Gutenberg Children" corpus, having the
> test on the same "Gutenberg Children" seems reasonable, right?
>

Yes. You still need to verify that each word in the "golden" corpus occurs
at least N=10 or 20 times in the training corpus. The dependency of
accuracy on N is not generally known, but it is very clear that if a word
occurs only N=3 times in the training corpus, then whatever is learned
about it will be very low quality.


> But thanks, we may have put mire effort in removal of ancient
> constructions and words even if these are present in the corpus.
>
If you consistently train on 19th century literature, and then evaluate
19th-century literature comprehension, that's fine.  Just don't expect it
to work for 21st century blog posts.

The strongest effect will be the N=number of observations effect.


> >Anyway -- you only indicate pair-wise word-links. Is the omission of
> disjuncts intentional?
>
> If you have all links in the sentence, you can construct all of the
> disjuncts with o ambiguity, correct?
>
No, but only because you did not indicate the link-type.  The whole point
of a clustering step is to obtain a link-type; if you discard it, you will
never get  better-than-MST results. The link-type is critical for obtaining
the word-classes.  The whole point of learning is to learn the
word-classes; you've learned very little, if you know only word-pairs.

Consider this example:

I saw wood
I saw some wood

A solution that would be "almost perfect" (or "golden") would be this:

saw: {performer-of-actions}- & {sculptable-mass}+;
saw: {observer}-  & {viewable-thing}+;

These disambiguate the two different senses of the word "saw".  It's
impossible to have word-sense disambiguation without actually having these
disjuncts.  The word-pairs alone are not sufficient to report the link-type
connecting the words.  Clustering gives the other dictionary entries:

I: {performer-of-actions}+ or {observer}+;
wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);
some: {quantity-determiner}+;

Thus, the pronoun "I" also belong to two different word-sense categories:
performers and observers.  Compare to:

"The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of actions"
but cannot be an "observer".
"The dog saw some wood" -- dogs can be observers. They can perform some
actions; like run, jump, but they cannot saw, hammer, cut, stab.

The link-type is absolutely crucial to understanding a word.  The
language-learning project is all about learning the link-types. Without
correct link-type assignments, you cannot have correct parses.

... which is 100% of the problem with MST.  The problem with MST is not so
much that "its not accurate" -sure, it is not terribly accurate. But even
if MST or some MST-replacement was 100% accurate, it would still be "wrong"
because it fails to indicate the link-type.  If you want to understand a
sentence, you MUST know the link-types!

Otherwise, you just have "green ideas sleep furiously", which parses, but
only because the link types have been erased, or made stupid.  Here's a
stupid grammar:

ideas:  {adjective}- & {verb}+;
green: {adjective}+;

which allows "green ideas" to parse.  But of course, this is wrong; it
should have been:

ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
green: {physical-object-modifier}+;

and now it is clear that "green ideas" cannot parse, because the link-types
clash.

* If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) you
will get very low quality grammars.

* If you cluster to 200 or 300 clusters, you get sort-of-OK grammars. This
is what deep-learning/neural-nets do: this is why the deep-learning systems
seem to give nice results: 200 or 300 features is enough to start having
adequate functional distinctions (e.g. the famous "king -
male+female=queen" example, or "paris-france+germany=berlin" example)

* If you cluster to 3K to 8K clusters, you start having a quite decent
model of language

* Note that wordnet has 117K "synsets".

Note that in the above example:
wood: {sculptable-mass}- or ({quantity-determiner}- & {viewable-thing}-);

the things in the curly-braces are effectively "synsets".

The next set of goal-posts is to have disjuncts, of maybe low-medium
quality, and use these to extract ontologies.  e.g.
{sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}

You can try to do this by clustering but there are probably better ways of
discovering ontology.



> >Also -- no hint of any word-classes or part-of-speech tagging? This is
> surely important to evaluate as well, or is this to be done in some other
> way?  i.e. to evaluate if "Pivi" was correctly clustered with other given
> names?  Or that lama/llama was clustered with other four-legged animals?
>
> We don't have that in MST-Parsing, right? We need this corpus to assess
> the quality of the MST-Parsing so we don't need part-of-speech information
> for that.
>
But we know that MST parsing is shit.  Stop wasting time on MST or trying
to "improve" it. We already know that it is close to a high-entropy path to
structure; trying to squeeze a few more percent of entropy is not worth the
effort, not at this time.  Focus on finding a high-entropy structure
extraction algorithm, don't waste time on MST.

You should be focusing on extracting disjuncts, word-classes, word-senses,
and trying to improve the quality of those.  If you obtain a high-entropy
path to these structures, the quality of your parses will automatically
improve.  Focus on the entropy numbers. Try to maximize that.

The clustering is able to do that anyway - see the graphs in the end of the
> last year report:
>
>
> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>
> >Also -- I can't tell -- is it free of loops, or are loops allowed?
> Allowing loops tends to provide stronger, more accurate parses.  Loops act
> as constraints.
>
> The loops and crossing links are not allowed in the MST-Parser now. If we
> allow them in the test corpus, how could it make assessment of MST-Parses
> better?
>
> Note, that we ARE working we MST-Parses now - accordingly to Ben's
> directions.
>

Not to say bad things about Ben, but I'm certain he has not actually
thought about this problem very much. He is very very busy doing other
things; he is not thinking about this stuff.  I have repeatedly tried to
explain the issues to him, and its quite clear that he is far away from
understanding them, from working at the level that I would like to have you
and your team work at.

I'm trying to have you make small, quantified baby-steps, to verify the
accuracy of your methods and data.  What I'm seeing is that you are
attempting to make giant-steps, without verification, and then getting
low-quality results, without understanding the root causes for them.  You
can't dig yourself out of a ditch, and digging harder and more furiously
won't raise the accuracy of the parse results.

--linas

We have your MST-Parser-less idea on the map but we are NOT trying it now:
>
> https://github.com/singnet/language-learning/issues/170
>
> We may try it after we explore the account for costs
>
> https://github.com/singnet/language-learning/issues/183
>
> Thanks,
>
> -Anton
> 24.03.2019 9:24, Linas Vepstas пишет:
>
> Also, BTW, link-grammar cannot parse "I just stood there, my hand on the
> knob, trembling like a leaf." correctly. It is one of a class of sentences
> it does not know about.  Which is maybe OK, because ideally, the learned
> grammar will be able to do this. But today, LG cannot.
>
> --linas
>
> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <[email protected]>
> wrote:
>
>> Anton,
>>
>> It's certainly an unusual corpus, and it might give you rather low
>> scores. I'd call it "interesting", but maybe not "golden". Although I
>> suppose it depends on your training corpus.  Here are some problems that
>> pop out:
>>
>> First sentence --
>> "the old beast was whinnying on his shoulder" -- the word "whinnying" is
>> a fairly rare English verb -- you could read half-a-million wikipedia
>> articles, and not see it once. You could read lots of 19th-century or
>> early-20th century cowboy/adventure novels, (like what you'd find on
>> Project Gutenberg) and maybe see it some fair amount. Even then -- to
>> "whinny on a shoulder" seems bizarre.. I guess he's hugging the horse? How
>> often does that happen, in any cowboy novel? "to whinny on something" is an
>> extremely rare construction.  It will work only if you've correctly
>> categorized "whinny" as a verb that can take a preposition.  Are your
>> clustering algos that good, yet, to correctly cluster rare words into
>> appropriate verb categories?
>>
>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've never
>> heard of it as a name before.  Your training data is going to be extremely
>> slim on this. And lack of training data means poor statistics, which means
>> low scores.  Unless -- again, your clustering code is good enough to place
>> "Jims" in a "proper name" cluster...
>>
>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost
>> archaic verb. These days, everyone spells llama with two ll's not one.
>> Unless your talking about Buddhist monks, its a typo.
>>
>> "you understand?"  is .. awkward. Common in speech, uncommon in writing.
>> Unlikely that you'll have enough training data for this.
>>
>> "Willard" is an uncommon name. Does your training corp[us have a
>> sufficient number of mentions of Willard? Do you have clustering working
>> well enough to stick "Willard" into a cluster with other names?
>>
>> "it is so with Sammy Jay" is clearly archaic English.
>>
>> "he hasn't any relations here" is clearly archaic, an olde-fashioned
>> construction.
>>
>> "Pivi said not one word" - again, a clearly old-fashioned construction.
>> Does the training set contain enough examples of "Pivi" to recognize it as
>> a name? Are names clustering correctly?
>>
>> Any sentence with an inversion is going to sound old-fashioned. All of
>> the sentences in that corpus sound old-fashioned. Which maybe is OK if you
>> are training on 19th century Gutenberg texts .. but its certainly not
>> modern English.  Even when I was a child, and I read those old
>> crumbly-yellow paper adventure books, part of the fun was that no one
>> actually talked that way -- not at school, not at home, not on TV. It was
>> clearly from a different time and place -- an adventure.
>>
>> Anyway -- you only indicate pair-wise word-links. Is the omission of
>> disjuncts intentional? Also -- no hint of any word-classes or
>> part-of-speech tagging? This is surely important to evaluate as well, or is
>> this to be done in some other way?  i.e. to evaluate if "Pivi" was
>> correctly clustered with other given names?  Or that lama/llama was
>> clustered with other four-legged animals?
>>
>> Also -- I can't tell -- is it free of loops, or are loops allowed?
>> Allowing loops tends to provide stronger, more accurate parses.  Loops act
>> as constraints.
>>
>> -- Linas
>>
>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <
>> [email protected]> wrote:
>>
>>> Hi Linas, Andes and whoever understands LG and English well enough both.
>>>
>>> Attached are first 100 sentences for GC "gold standard" - manually
>>> checked based on LG parses.
>>>
>>> We are expecting more to come in the next two weeks.
>>>
>>> To enable that, please have cursory review of the corpus and let us know
>>> if there are corrections still needed so your corrections will be used as a
>>> reference to fix the rest and keep going further.
>>>
>>> Thank you,
>>>
>>> -Anton
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "lang-learn" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
>>> <https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to