[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Linas Vepstas Mon, 22 Apr 2019 21:37:14 -0700

On Mon, Apr 22, 2019 at 10:48 PM Ben Goertzel <b...@goertzel.org> wrote:


> ***
> Thank you!  This is fairly impressive: it says that if the algo heard
> a word five or more times, that was sufficient for it to deduce the
> correct grammatical form!
> ***
>
> Yes.   What we can see overall is that, with the current algorithms
> Anton's team is using: If we have "correct" unlabeled dependency
> parses, then we can infer "correct" parts-of-speech and POS-based
> grammatical rules... for words that occur often enough (5 times with
> current corpus and parameters)
>

Ah, well, hmm. It appears I had misunderstood. I did not realize that the
input was 100% correct but unlaballed parses. In this case, obtaining 100%
accuracy is NOT suprising, its actually just a proof that the code is
reasonably bug-free. Such proofs are good to have, but its not
theoretically interesting. Its kind of like saying "we proved that our
radio telescope is pointed in the right direction".  Which is an important
step.


> So the problem of unsupervised grammar induction is, in this sense,
> reduced to the problem of getting correct-enough unlabeled dependency
> parses ...
>

Oh, no at all! Exactly the opposite!! Now that the telescope is pointed in
the right direction, what is the actual signal?

My claim is that this mechanism acts as an "amplifier" and a "noise filter"
-- that it can take low-quality MST parses as input,  and still generate
high-quality results.   In fact, I make an even stronger claim: you can
throw *really low quality data* at it -- something even worse than MST, and
it will still return high-quality grammars.

This can be explicitly tested now:  Take the 100% perfect unlaballed
parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50%
random errors into it. What is the accuracy of the learned grammar?  I
claim that you can introduce 30% errors, and still learn a grammar with
greater than 80% accuracy.  I claim this, I think it is a very important
point -- a key point - but I cannot prove it.

It is a somewhat delicate experiment -- the corpus has to be large enough.
If you introduce a 30% error rate into the unlabelled parses, then certain
rare words (seen 6 or fewer times) will be used incorrectly, reducing the
effective count to 4 or less ... So the MWC "minimum word count" would need
to get larger, the greater the number of errors.  But if the MWC is large
enough (maybe 5 or 10, less than 20) and the corpus is large enough, then
you should still get high-quality grammars from low-quality inputs.

-- Linas


> The current MST parser, on corpora of the sizes we have been able to
> feed it, does not produce correct-enough unlabeled dependency parses.
>  One thread of current research is to see if using info from modern
> DNN models, in place of simple mutual information, can cause an
> MST-type parser to produce correct-enough unlabeled dependency
> parses....  (where "correct" means agreement w/ human-expert
> grammatical judgments, in this case)
>
> ben
>
> On Tue, Apr 23, 2019 at 11:40 AM Linas Vepstas <linasveps...@gmail.com>
> wrote:
> >
> > Hi Anton,
> >
> > On Mon, Apr 15, 2019 at 11:18 AM Anton Kolonin @ Gmail <
> akolo...@gmail.com> wrote:
> >>
> >> Ben, Linas,
> >>
> >> Let me comment on latest results, given LG-English parses are given as
> >> input for Grammar Learner using Identical Lexical Entries (ILE)
> >> algorithm and compared against the same input LG-English parses - for
> >> Gutenberg Children corpus with direct speech taken off, using only
> >> complete LG-English parses for testing and training.
> >>
> >> MWC - Minimum Word Count, so test only on the the sentences where every
> >> word in the sentence occurs given number of times or more.
> >>
> >> MSL - Maximum Sentence Length, so test only on the the sentences which
> >> has given number of words or less.
> >>
> >> MWC(GT) MSL(GT) PA      F1
> >> 0       0        61.69%   0.65 - all input sentences are used for test
> >> 5       0       100.00%   1.00 - sentences with each word occurring 5+
> >> 10      0       100.00%   1.00 - sentences with each word occurring 10+
> >> 50      0       100.00%   1.00 - sentences with each word occurring 50+
> >> That is:
> >>
> >> 1) With words occurring 5 and more times recall=1.0 and precision-1.0;
> >
> >
> > Thank you!  This is fairly impressive: it says that if the algo heard a
> word five or more times, that was sufficient for it to deduce the correct
> grammatical form!  This is something that is considered to be very
> important when people compare machine learning to human learning -- it is
> said that "humans can learn from very few examples and machines cannot",
> yet here we have an explicit demonstration of an algorithm that can learn
> perfect accuracy with only five examples!  I think that is absolutely
> awesome, and is the kind of news that can be shouted from off of rooftops!
> Its kind of a "we did it! success!" kind of story.
> >
> > The fact that the knee of the curve occurs at or below 5 is huge -- very
> very different than if it occurred at 50.
> >
> > However, just to be clear --- it would be very useful if you or Alexy
> provided examples of words that were seen only 2 or 3 times, and the kinds
> of sentences they appeared in.
> >
> >>
> >> 2) Shorter sentences provide better recall and precision.
> >>>
> >>>
> >>> 0       5        70.06%   0.72 - sentences of 5 words and shorter
> >>>
> >>> 0       10       66.60%   0.69 - sentences of 10 words and shorter
> >>>
> >>> 0       15       63.87%   0.67 - sentences of 15 words and shorter
> >>>
> >>> 0       25       61.69%   0.65 - sentences of 25 words and shorter
> >
> >
> > This is meaningless - a nonsense statistic.  It just says "the algo
> encountered a word only once or twice or three times, and fails to use that
> word correctly in a long sentence. It also fails to use it correctly in a
> short sentence." Well, duhhh -- if I invented a brand new word you never
> heard of before, and gave you only one or two examples of using that word,
> of course, you would be lucky to have a 60% or 70% accuracy of using that
> word!!  The above four data-points are mostly useless and meaningless.
> >
> > --linas
> >
> >>
> >>
> >> Note:
> >>
> >> 1) Identical Lexical Entries (ILE) algorithm is "over-fitting" in fact,
> >> so there is still way to go being able to learn "generalized grammars";
> >> 2) Same kind of experiment is still to be done with MST-Parses and
> >> results are not expected to be that glorious, given what we know about
> >> Pearson correlation between F1-s on different parses ;-)
> >>
> >> Definitions of PA and F1 are in the attached paper.
> >>
> >> Cheers,
> >> -Anton
> >>
> >>
> >> --------
> >>
> >>
> >> *Past Week:*
> >> 1. Provided data for GC for ALE and dILEd.
> >> 2. Fixed GT to allow parsing sentenses starting with numbers in ULL
> mode.
> >> 3. Ended up with Issue #184, ran several tests for different corpora
> >> with different settings of MWC and MSL:
> >> - Nothing interesting for POC-English;
> >> - CDS seems to be dependent on ratio of number of incompletely parsed
> >> sentences to number of completely parsed sentenses which make up corpus
> >> subset defined by MWC/MSL restriction.
> >>
> http://langlearn.singularitynet.io/data/aglushchenko_parses/CDS-dILEd-MWC-MSL-2019-04-13/CDS-dILEd-MWC-MSL-2019-04-13-summary.txt
> >> - Much more reliable result is obtained on GC corpus with no direct
> speech.
> >>
> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-13/GCB-NQ-dILEd-MWC-MSL-summary.txt
> >> 4. Small improvement to pipeline code were made.
> >>
> >> *Next week:*
> >> 1. Resolve Issue #188
> >> 2. Resolve Issue #198
> >> 3. Resolve Issue #193
> >> 4. Pipeline improvements along the way.
> >>
> >> Alexey
> >>
> >
> >
> > --
> > cassette tapes - analog TV - film cameras - you
>
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "Listen: This world is the lunatic's sphere,  /  Don't always agree
> it's real.  /  Even with my feet upon it / And the postman knowing my
> door / My address is somewhere else." -- Hafiz
>


-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA37EdDmznxu2rtQCFuEj%3DfXLcTtHZEk-AQizCxU0Dddbyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Reply via email to