[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Ben Goertzel Mon, 22 Apr 2019 20:49:07 -0700

***
Thank you!  This is fairly impressive: it says that if the algo heard
a word five or more times, that was sufficient for it to deduce the
correct grammatical form!
***


Yes.   What we can see overall is that, with the current algorithms
Anton's team is using: If we have "correct" unlabeled dependency
parses, then we can infer "correct" parts-of-speech and POS-based
grammatical rules... for words that occur often enough (5 times with
current corpus and parameters)

So the problem of unsupervised grammar induction is, in this sense,
reduced to the problem of getting correct-enough unlabeled dependency
parses ...

The current MST parser, on corpora of the sizes we have been able to
feed it, does not produce correct-enough unlabeled dependency parses.
 One thread of current research is to see if using info from modern
DNN models, in place of simple mutual information, can cause an
MST-type parser to produce correct-enough unlabeled dependency
parses....  (where "correct" means agreement w/ human-expert
grammatical judgments, in this case)

ben

On Tue, Apr 23, 2019 at 11:40 AM Linas Vepstas <[email protected]> wrote:
>
> Hi Anton,
>
> On Mon, Apr 15, 2019 at 11:18 AM Anton Kolonin @ Gmail <[email protected]> 
> wrote:
>>
>> Ben, Linas,
>>
>> Let me comment on latest results, given LG-English parses are given as
>> input for Grammar Learner using Identical Lexical Entries (ILE)
>> algorithm and compared against the same input LG-English parses - for
>> Gutenberg Children corpus with direct speech taken off, using only
>> complete LG-English parses for testing and training.
>>
>> MWC - Minimum Word Count, so test only on the the sentences where every
>> word in the sentence occurs given number of times or more.
>>
>> MSL - Maximum Sentence Length, so test only on the the sentences which
>> has given number of words or less.
>>
>> MWC(GT) MSL(GT) PA      F1
>> 0       0        61.69%   0.65 - all input sentences are used for test
>> 5       0       100.00%   1.00 - sentences with each word occurring 5+
>> 10      0       100.00%   1.00 - sentences with each word occurring 10+
>> 50      0       100.00%   1.00 - sentences with each word occurring 50+
>> That is:
>>
>> 1) With words occurring 5 and more times recall=1.0 and precision-1.0;
>
>
> Thank you!  This is fairly impressive: it says that if the algo heard a word 
> five or more times, that was sufficient for it to deduce the correct 
> grammatical form!  This is something that is considered to be very important 
> when people compare machine learning to human learning -- it is said that 
> "humans can learn from very few examples and machines cannot", yet here we 
> have an explicit demonstration of an algorithm that can learn perfect 
> accuracy with only five examples!  I think that is absolutely awesome, and is 
> the kind of news that can be shouted from off of rooftops!  Its kind of a "we 
> did it! success!" kind of story.
>
> The fact that the knee of the curve occurs at or below 5 is huge -- very very 
> different than if it occurred at 50.
>
> However, just to be clear --- it would be very useful if you or Alexy 
> provided examples of words that were seen only 2 or 3 times, and the kinds of 
> sentences they appeared in.
>
>>
>> 2) Shorter sentences provide better recall and precision.
>>>
>>>
>>> 0       5        70.06%   0.72 - sentences of 5 words and shorter
>>>
>>> 0       10       66.60%   0.69 - sentences of 10 words and shorter
>>>
>>> 0       15       63.87%   0.67 - sentences of 15 words and shorter
>>>
>>> 0       25       61.69%   0.65 - sentences of 25 words and shorter
>
>
> This is meaningless - a nonsense statistic.  It just says "the algo 
> encountered a word only once or twice or three times, and fails to use that 
> word correctly in a long sentence. It also fails to use it correctly in a 
> short sentence." Well, duhhh -- if I invented a brand new word you never 
> heard of before, and gave you only one or two examples of using that word, of 
> course, you would be lucky to have a 60% or 70% accuracy of using that word!! 
>  The above four data-points are mostly useless and meaningless.
>
> --linas
>
>>
>>
>> Note:
>>
>> 1) Identical Lexical Entries (ILE) algorithm is "over-fitting" in fact,
>> so there is still way to go being able to learn "generalized grammars";
>> 2) Same kind of experiment is still to be done with MST-Parses and
>> results are not expected to be that glorious, given what we know about
>> Pearson correlation between F1-s on different parses ;-)
>>
>> Definitions of PA and F1 are in the attached paper.
>>
>> Cheers,
>> -Anton
>>
>>
>> --------
>>
>>
>> *Past Week:*
>> 1. Provided data for GC for ALE and dILEd.
>> 2. Fixed GT to allow parsing sentenses starting with numbers in ULL mode.
>> 3. Ended up with Issue #184, ran several tests for different corpora
>> with different settings of MWC and MSL:
>> - Nothing interesting for POC-English;
>> - CDS seems to be dependent on ratio of number of incompletely parsed
>> sentences to number of completely parsed sentenses which make up corpus
>> subset defined by MWC/MSL restriction.
>> http://langlearn.singularitynet.io/data/aglushchenko_parses/CDS-dILEd-MWC-MSL-2019-04-13/CDS-dILEd-MWC-MSL-2019-04-13-summary.txt
>> - Much more reliable result is obtained on GC corpus with no direct speech.
>> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-NQ-dILEd-MWC-MSL-2019-04-13/GCB-NQ-dILEd-MWC-MSL-summary.txt
>> 4. Small improvement to pipeline code were made.
>>
>> *Next week:*
>> 1. Resolve Issue #188
>> 2. Resolve Issue #198
>> 3. Resolve Issue #193
>> 4. Pipeline improvements along the way.
>>
>> Alexey
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you



-- 
Ben Goertzel, PhD
http://goertzel.org

"Listen: This world is the lunatic's sphere,  /  Don't always agree
it's real.  /  Even with my feet upon it / And the postman knowing my
door / My address is somewhere else." -- Hafiz

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CACYTDBddoSXfceOu2MUSgmShzJAhhTF4%3DyTBwzULsbZsO4mEaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Reply via email to