[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Linas Vepstas Mon, 22 Apr 2019 21:14:02 -0700

On Mon, Apr 15, 2019 at 9:02 PM Anton Kolonin @ Gmail <[email protected]>
wrote:


> Ben,
>
> I'd be curious to see some examples of the sentences used in
>
> ***
> 5       0       100.00%   1.00 - sentences with each word occurring 5+
> 10      0       100.00%   1.00 - sentences with each word occurring 10+
> 50      0       100.00%   1.00 - sentences with each word occurring 50+
> ***
>
> Alexey, please provide.
>
> So if I understand right, you're doing grammar inference here, but
> using link parses (with the hand-coded English grammar) as data ...
> right?   So it's a test of how well the grammar inference methodology
> works if one has a rather good set of dependency linkages to work with
> ...?
>
> Yes.
>
Oh.  Well, that is not at all what I thought you were describing in the
earlier emails. If you have perfect parses to begin with, then extracting
dependencies from perfect parses is ... well, not exactly trivial, but also
not hard. So getting 100% accuracy is actually a kind-of unit-test; it
proves that your code does not have any bugs in it.

2) To which extent "the best of MST" parses will be worse than what we have
> above (in progress)
>
> 3) If we can get quality of "the best of MST" parses close to that
> (DNN-MI-lking, etc.)
>
What does "the best of MST" mean?  The goal is to use MST-provided parses,
discard all words/sentences in which a word occurs less than N times, and
see what the result is.  I am still expecting a knee at N=5 and not N=50.

> 4) If we can learn grammar in more generalized way (hundreds of rules
> instead of thousands)
>
The size of your grammar depends strongly on the size of your vocabulary.
For a child's corpus, I think it's "impossible" to get an accurate grammar
with less than 800 or 1000 rules.  The current English LG dictionary has
approximately 8K rules.

I do not have a good way of estimating a "reasonable" dictionary size --
Again -- Zipf's Law means that only a small number of rules are used
frequently, and that 3/4ths of all rules are used to handle corner cases.
To be clear: for the Child's corpus, if you learned 1000 rules total, then
I would expect that 250 rules would be triggered 5 or more times, while the
remaining 750 rules would trigger only 1,2,3 or 4 times.   That is my guess.

Actually creating, seeing what this graph looks like -- that would be ...
very interesting.  It would reveal something important about language.
Zipf's law says something very important -- that, hiding behind an apparent
regularity, the exceptions and corner-cases are frequent and common. I
expect this to hold for the learned grammars.

What is means, in practice, is that the size of your grammar is determined
by the size of your training set, specifically, by the integral under the
curve, from 1 or more observations of a word.

-- Linas

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36P9P2DnAPrViCoZX3c4RhjWQTb0iwbXJmptde5oiap-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Reply via email to