[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Ben Goertzel Tue, 23 Apr 2019 03:09:34 -0700

***
Ah, well, hmm. It appears I had misunderstood. I did not realize that
the input was 100% correct but unlaballed parses. In this case,
obtaining 100% accuracy is NOT suprising, its actually just a proof
that the code is reasonably bug-free.
***


 It's a proof that the algorithms embodied in this portion of the code
are actually up to the task.   Not just a proof that the code is
relatively bug-free, except in a broad sense of "bug" as "algorithm
that doesn't fulfill the intended goals"

(I know you understand this, I'm just clarifying for the rest of the
audience...)

***
 Such proofs are good to have, but its not theoretically interesting.
***

I think it's theoretically somewhat interesting, because there are a
lot of possible ways to do clustering and grammar rule learning, and
now we know a specific combination of clustering algorithm and grammar
rule learning algorithm that actually works (if the input dependency
parses are good)

But it's not yet the conceptual breakthrough we are chasing...

***
Its kind of like saying "we proved that our radio telescope is pointed
in the right direction".  Which is an important step.
***

I think it's more like saying "Yay! our telescope works and is pointed
in the right direction"  ;-) ....

But yeah, it means a bunch of the "more straightforward" parts of the
grammar-induction task are working now, so all we have to do is
finally solve the harder part, i.e. making decent unlabeled dependency
trees in an unsupervised way

Of course one option is that this clustering/rule-learning process is
part of a feedback process that produces said decent unlabeled
dependency trees

Then the approach would be

-- shitty MST parses
-- shitty inferred grammar
-- use shitty inferred grammar to get slightly less shitty parses
-- use slightly less shitty parses to get slightly less shitty inferred grammar
-- etc. until most of the shit disappears and you're left with just
the same level of shit as in natural language...

Another option is to use DNNs to get nicer parses and just do

-- nice MST parses guided by DNNs
-- nice inferred grammar from these parses

Maybe what will actually work is more like

-- semi-shitty MST parses guided by DNNs
-- semi-shitty inferred grammar
-- use semi-shitty inferred grammar together with DNNs to get  less
shitty parses
-- use  less shitty parses to get even less shitty inferred grammar
-- etc. until most of the shit disappears and you're left with just
the same level of shit as in natural language...


.. ben

On Tue, Apr 23, 2019 at 12:37 PM Linas Vepstas <linasveps...@gmail.com> wrote:
>
>
>
> On Mon, Apr 22, 2019 at 10:48 PM Ben Goertzel <b...@goertzel.org> wrote:
>>
>> ***
>> Thank you!  This is fairly impressive: it says that if the algo heard
>> a word five or more times, that was sufficient for it to deduce the
>> correct grammatical form!
>> ***
>>
>> Yes.   What we can see overall is that, with the current algorithms
>> Anton's team is using: If we have "correct" unlabeled dependency
>> parses, then we can infer "correct" parts-of-speech and POS-based
>> grammatical rules... for words that occur often enough (5 times with
>> current corpus and parameters)
>
>
> Ah, well, hmm. It appears I had misunderstood. I did not realize that the 
> input was 100% correct but unlaballed parses. In this case, obtaining 100% 
> accuracy is NOT suprising, its actually just a proof that the code is 
> reasonably bug-free. Such proofs are good to have, but its not theoretically 
> interesting. Its kind of like saying "we proved that our radio telescope is 
> pointed in the right direction".  Which is an important step.
>
>>
>> So the problem of unsupervised grammar induction is, in this sense,
>> reduced to the problem of getting correct-enough unlabeled dependency
>> parses ...
>
>
> Oh, no at all! Exactly the opposite!! Now that the telescope is pointed in  
> the right direction, what is the actual signal?
>
> My claim is that this mechanism acts as an "amplifier" and a "noise filter" 
> -- that it can take low-quality MST parses as input,  and still generate 
> high-quality results.   In fact, I make an even stronger claim: you can throw 
> *really low quality data* at it -- something even worse than MST, and it will 
> still return high-quality grammars.
>
> This can be explicitly tested now:  Take the 100% perfect unlaballed parses, 
> and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50% random errors 
> into it. What is the accuracy of the learned grammar?  I claim that you can 
> introduce 30% errors, and still learn a grammar with greater than 80% 
> accuracy.  I claim this, I think it is a very important point -- a key point 
> - but I cannot prove it.
>
> It is a somewhat delicate experiment -- the corpus has to be large enough.  
> If you introduce a 30% error rate into the unlabelled parses, then certain 
> rare words (seen 6 or fewer times) will be used incorrectly, reducing the 
> effective count to 4 or less ... So the MWC "minimum word count" would need 
> to get larger, the greater the number of errors.  But if the MWC is large 
> enough (maybe 5 or 10, less than 20) and the corpus is large enough, then you 
> should still get high-quality grammars from low-quality inputs.
>
> -- Linas
>
>>
>> The current MST parser, on corpora of the sizes we have been able to
>> feed it, does not produce correct-enough unlabeled dependency parses.
>>  One thread of current research is to see if using info from modern
>> DNN models, in place of simple mutual information, can cause an
>> MST-type parser to produce correct-enough unlabeled dependency
>> parses....  (where "correct" means agreement w/ human-expert
>> grammatical judgments, in this case)
>>
>> ben
>>
>> On Tue, Apr 23, 2019 at 11:40 AM Linas Vepstas <linasveps...@gmail.com> 
>> wrote:
>> >
>> > Hi Anton,
>> >
>> > On Mon, Apr 15, 2019 at 11:18 AM Anton Kolonin @ Gmail 
>> > <akolo...@gmail.com> wrote:
>> >>
>> >> Ben, Linas,
>> >>
>> >> Let me comment on latest results, given LG-English parses are given as
>> >> input for Grammar Learner using Identical Lexical Entries (ILE)
>> >> algorithm and compared against the same input LG-English parses - for
>> >> Gutenberg Children corpus with direct speech taken off, using only
>> >> complete LG-English parses for testing and training.
>> >>
>> >> MWC - Minimum Word Count, so test only on the the sentences where every
>> >> word in the sentence occurs given number of times or more.
>> >>
>> >> MSL - Maximum Sentence Length, so test only on the the sentences which
>> >> has given number of words or less.
>> >>
>> >> MWC(GT) MSL(GT) PA      F1
>> >> 0       0        61.69%   0.65 - all input sentences are used for test
>> >> 5       0       100.00%   1.00 - sentences with each word occurring 5+
>> >> 10      0       100.00%   1.00 - sentences with each word occurring 10+
>> >> 50      0       100.00%   1.00 - sentences with each word occurring 50+
>> >> That is:
>> >>
>> >> 1) With words occurring 5 and more times recall=1.0 and precision-1.0;
>> >
>> >
>> > Thank you!  This is fairly impressive: it says that if the algo heard a 
>> > word five or more times, that was sufficient for it to deduce the correct 
>> > grammatical form!  This is something that is considered to be very 
>> > important when people compare machine learning to human learning -- it is 
>> > said that "humans can learn from very few examples and machines cannot", 
>> > yet here we have an explicit demonstration of an algorithm that can learn 
>> > perfect accuracy with only five examples!  I think that is absolutely 
>> > awesome, and is the kind of news that can be shouted from off of rooftops! 
>> >  Its kind of a "we did it! success!" kind of story.
>> >
>> > The fact that the knee of the curve occurs at or below 5 is huge -- very 
>> > very different than if it occurred at 50.
>> >
>> > However, just to be clear --- it would be very useful if you or Alexy 
>> > provided examples of words that were seen only 2 or 3 times, and the kinds 
>> > of sentences they appeared in.
>> >
>> >>
>> >> 2) Shorter sentences provide better recall and precision.
>> >>>
>> >>>
>> >>> 0       5        70.06%   0.72 - sentences of 5 words and shorter
>> >>>
>> >>> 0       10       66.60%   0.69 - sentences of 10 words and shorter
>> >>>
>> >>> 0       15       63.87%   0.67 - sentences of 15 words and shorter
>> >>>
>> >>> 0       25       61.69%   0.65 - sentences of 25 words and shorter
>> >
>> >
>> > This is meaningless - a nonsense statistic.  It just says "the algo 
>> > encountered a word only once or twice or three times, and fails to use 
>> > that word correctly in a long sentence. It also fails to use it correctly 
>> > in a short sentence." Well, duhhh -- if I invented a brand new word you 
>> > never heard of before, and gave you only one or two examples of using that 
>> > word, of course, you would be lucky to have a 60% or 70% accuracy of using 
>> > that word!!  The above four data-points are mostly useless and meaningless.
>> >
>> > --linas
>> >
>> >>
>> >>
>> >> Note:
>> >>
>> >> 1) Identical Lexical Entries (ILE) algorithm is "over-fitting" in fact,
>> >> so there is still way to go being able to learn "generalized grammars";
>> >> 2) Same kind of experiment is still to be done with MST-Parses and
>> >> results are not expected to be that glorious, given what we know about
>> >> Pearson correlation between F1-s on different parses ;-)
>> >>
>> >> Definitions of PA and F1 are in the attached paper.
>> >>
>> >> Cheers,
>> >> -Anton
>> >>
>> >>
>> >> --------
>> >>
>> >>
>> >> *Past Week:*
>> >> 1. Provided data for GC for ALE and dILEd.
>> >> 2. Fixed GT to allow parsing sentenses starting with numbers in ULL mode.
>> >> 3. Ended up with Issue #184, ran several tests for different corpora
>> >> with different settings of MWC and MSL:
>> >> - Nothing interesting for POC-English;
>> >> - CDS seems to be dependent on ratio of number of incompletely parsed
>> >> sentences to number of completely parsed sentenses which make up corpus
>> >> subset defined by MWC/MSL restriction.
>> >> http://langlearn.singularitynet.io/data/aglushchenko_parses/CDS-dILEd-MWC-MSL-2019-04-13/CDS-dILEd-MWC-MSL-2019-04-13-summary.txt
>> >> - Much more reliable result is obtained on GC corpus with no direct 
>> >> speech.
>> >> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-13/GCB-NQ-dILEd-MWC-MSL-summary.txt
>> >> 4. Small improvement to pipeline code were made.
>> >>
>> >> *Next week:*
>> >> 1. Resolve Issue #188
>> >> 2. Resolve Issue #198
>> >> 3. Resolve Issue #193
>> >> 4. Pipeline improvements along the way.
>> >>
>> >> Alexey
>> >>
>> >
>> >
>> > --
>> > cassette tapes - analog TV - film cameras - you
>>
>>
>>
>> --
>> Ben Goertzel, PhD
>> http://goertzel.org
>>
>> "Listen: This world is the lunatic's sphere,  /  Don't always agree
>> it's real.  /  Even with my feet upon it / And the postman knowing my
>> door / My address is somewhere else." -- Hafiz
>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>
> --
> You received this message because you are subscribed to the Google Groups 
> "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to lang-learn+unsubscr...@googlegroups.com.
> To post to this group, send email to lang-le...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/lang-learn/CAHrUA37EdDmznxu2rtQCFuEj%3DfXLcTtHZEk-AQizCxU0Dddbyg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Ben Goertzel, PhD
http://goertzel.org

"Listen: This world is the lunatic's sphere,  /  Don't always agree
it's real.  /  Even with my feet upon it / And the postman knowing my
door / My address is somewhere else." -- Hafiz

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBcBF-HDrOQCH2y4kFcdOh5ogcuOk0Gdyh%3DAUdgVPqZNfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Reply via email to