[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Linas Vepstas Wed, 01 May 2019 17:07:50 -0700

On Wed, Apr 24, 2019 at 9:31 PM Anton Kolonin @ Gmail <akolo...@gmail.com>
wrote:


> Ben, Linas, here is full set of results generated by Alexey:
>
> Results update:
>

My gut intuition is that the most interesting numbers would be this:


> MWC(GT) MSL(GT) PA      F1
>
> 5       2
> 5       3
> 5       4
> 5       5
> 5       10
> 5       15
> 5       25
>
>
because I think that "5" gets you over the hump for central-limit.  But,
per earlier conversation: the disjuncts need to be weighted by something,
as otherwise, you will get accuracy more-or-less exactly equal to MST
accuracy. Without weighting, you cannot improve on MST.   The weighting is
super-important to do, and discovering the best weighting scheme is one
major task (is it MI, surprisingness, something else?)


> I just though that "Currently, Identical Lexical Entries (ILE) algorithm
> builds single-germ/multi-disjunct lexical entires (LE) first, and then
> aggregates identical ones based on unique combinations of disjuncts" is
> sufficient.
>
OK, so, by "lexical entry", I guess you mean "a single word-disjunct
pair",  where he disjunct connectors have not been clustered? So, yes, if
they are identical, then yes, you should add together the observation
counts.  (It's important to keep track of observation counts; this is
needed for computing MI.)

Note that, in principle, a "lexical entry" could also be a
(grammatical-class, disjunct) pair, or it could be a (word,disjunct-class)
pair, or a (grammatical-class, disjunct-class) pair, where
"grammatical-class" is a cluster, and "disjunct-class" is a disjunct with
connectors to that class (instead of connectors to individual words).  And
please note: what I meant by "disjnuct class" might not be the same thing
as what you think it means, and so, without a lot of extra explanation, it
gets confusing again.

At any rate, if you keep the clusters and aggregates in the atomspace, then
the "matrix" code can compute MI's for them all.  Else, you have to
redesign that from scratch.

Side note: one reason I wanted everything in the atomspace, was so that I
could apply the same class of algos -- computing MI, joining collections of
atoms into networks, MST-like, then clustering, then recomputing MI again,
etc. and leveraging that to obtain synonyms, word-senses, synonymous
phrases, pronoun referents, etc. all without having to have a total
redesign.  To basically treat networks generically, not just networks of
words, but networks of anythings,, expressed as atoms.

--linas

In meantime, it is in the code:
>
> https://github.com/singnet/language-learning/blob/master/src/grammar_learner/clustering.py#L276
>
> Cheers,
>
> -Anton
>
>
> 23.04.2019 16:54, Ben Goertzel пишет:
>
> On Mon, Apr 22, 2019 at 11:18 PM Anton Kolonin @ Gmail <akolo...@gmail.com> 
> <akolo...@gmail.com> wrote:
>
> We are going to repeat the same experiment with MST-Parses during this week.
>
> The much more interesting experiment is to see what happens when you give it 
> a known percentage of intentionally-bad unlabelled parses. I claim that this 
> step provides natural error-reduction, error-correction, but I don't know how 
> much.
>
> If we assume roughly that "insufficient data" has a similar effect to
> "noisy data", then the effect of adding intentionally-bad parses may
> be similar to the effect of having insufficient examples of the words
> involved... which we already know from Anton's experiments.   Accuracy
> degrades smoothly but steeply as number of examples decreases below
> adequacy.
>
> ***
> My claim is that this mechanism acts as an "amplifier" and a "noise
> filter" -- that it can take low-quality MST parses as input,  and
> still generate high-quality results.   In fact, I make an even
> stronger claim: you can throw *really low quality data* at it --
> something even worse than MST, and it will still return high-quality
> grammars.
>
> This can be explicitly tested now:  Take the 100% perfect unlaballed
> parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50%
> random errors into it. What is the accuracy of the learned grammar?  I
> claim that you can introduce 30% errors, and still learn a grammar
> with greater than 80% accuracy.  I claim this, I think it is a very
> important point -- a key point - but I cannot prove it.
> ***
>
> Hmmm.   So I am pretty sure you are right given enough data.
>
> However, whether this is true given the magnitudes of data we are now
> looking at (Gutenberg Childrens Corpus for example) is less clear to
> me
>
> Also the current MST parses are much worse than "30% errors" compared
> to correct parses.   So even if what you say is correct, it doesn't
> remove the need to improve the MST parses...
>
> But you are right -- this will be an interesting and important set of
> experiments to run.   Anton, I suggest you add it to the to-do list...
>
> -- Ben
>
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> +79139250058akolonin@aigents.comhttps://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA3558tMk58hF9u%2Be3ZyYdFUTZFAb9xpNKQJxUeXbKNrqyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: Testing the same unsupervisedly learned grammars on different kinds of corpora

Reply via email to