[opencog-dev] Re: finished experiment on GC - GL/GT on fully parsed sentences with MWC=1-5

Linas Vepstas Sat, 22 Jun 2019 08:18:20 -0700

Hi Anton,

On Sat, Jun 22, 2019 at 2:32 AM Anton Kolonin @ Gmail <[email protected]>
wrote:

>
> CAUTION: *** parses in the folder with dict files are not the inputs, but
> outputs - they are produced on basis of the grammar in the same folder, I
> am listing the input parses below !!! ***
>
I did not look either at your inputs or your outputs; they are irrelevant
for my purposes. It is enough for me to know that you trained on some texts
from project gutenberg.  When I evaluate the quality of your dictionaries,
I do not use your inputs, or outputs, or software; I have an independent
tool for the evaluation of your dictionaries.

It would be very useful if you kept track of how many word-pairs were
counted during training.  There are two important statistics to track: the
number of unique word-pairs, and the total number observed, with
multiplicity.  These two numbers are important summaries of the size of the
training set.  There are two other important numbers: the number of
*unique* words that occurred on the left side of a pair, and the number of
unique words that occurred on the right side of a pair. These two will be
almost equal, but not quite.  It would be very useful for me to know these
four numbers: the the first two characterize the *size* of your training
set; the second two characterize the size of the vocabulary.

>
> - row 63, learned NOT from parses produced by DNN, BUT from honest
> MST-Parses, however MI-values for that were extracted from DNN and made
> specific to context of every sentence, so each pair of words could have
> different MI-values in different sentences:
>
OK, Look: MI has a very precise definition. You cannot use some other
number you computed, and then call it "MI". Call it something else.  Call
it "DLA" -- Deep Learning Affinity. Affinity, because the word
"information" also has a very precise definition: it is the log-base-2 of
the entropy.  If it is not that, then it cannot be called "information".
Call it "BW" -- Bertram Weights, if I understand correctly.

So, if I understand correctly, you computed some kind of DLA/BW number for
word-pairs, and then preformed an MST parse using those numbers?

 exported in new "ull" format invented by Man Hin:
>
Side-comment -- you guys seem to be confused about what the atomspace is,
and what it is good for.  The **whole idea** of the atomspace is that it is
a "one size fits all" format, so that you do not have to "invent" new
formats.  There is a reason why databases, and graph databases are popular.
Inventing new file formats is a well-paved road to hell.

Regarding what you call "the breakthroughs":
> >Results from the ull-lgeng dataset indicates that the ULL pipeline is a
> high- fidelity transducer of grammars. The grammar that is pushed in is the
> effec- tively the same as the grammar that falls out. If this can be
> reproduced for other grammars, e.g. Stanford, McParseface or some HPSG
> grammar, then one has a reliable way of tuning the pipeline. After it is
> tuned to maximize fidelity on known grammars, then, when applied to unknown
> grammars, it can be assumed to be working correctly, so that whatever comes
> out must in fact be correct.
>
> That has been worked accordingly to the plan set up way back in 2017. I am
> glad that you accept the results. Unfortunately, the MST-Parser is not
> built-in into pipeline yet but is is on the way.
>
> If one like you could help with the outstanding work items, it would be
> appreciated, because we are short-handed now.
>
> >The relative lack of differences between the ull-dnn-mi and the 
> >ull-sequential
> datasets suggests that the accuracy of the so-called “MST parse” is
> relatively unimportant. Any parse, giving any results with
> better-than-random outputs can be used to feed the pipeline. What matters
> is that a lot of observation counts need to be accumulated so that junky
> parses cancel each-other out, on average, while good ones add up and occur
> with high frequency. That is, if you want a good signal, then integrate
> long enough that the noise cancels out.
>
> I would disagree (and I guess Ben may disagree as well) given the existing
> evidence with "full reference corpus".
>
I think you are mis-interpreting your own results. The "existing evidence"
proves the opposite of what you believe. (I suspect Ben is too busy to
think about this very deeply).

> If you compare F1 for LG-English parses with MST > 2 on tab "MWC-Study"
> you will find the F1 on LG-English parses is decent, so it is not that
> "parses do not matter", it is rather just "MST-Parses are even less
> accurate that sequential".
>
You are mis-understanding what I said; I think you are also
mis-understanding what your own data is saying.

The F1-for-LG-English is high because of two reasons: (1) natural language
grammar has the "decomposition property" (aka "lexical property"), and (2)
You are comparing the decomposition provided by LG to LG itself.

The "decomposition property" states that "grammar is lexical".  Natural
language is "lexical" when it's structure can be described by a "lexis" --
a dictionary, whose dictionary headings are words,  end whose dictionary
entries are word-definitions of some kind -- disjuncts for LG; something
else for Stanford/McParseface/HPSG/etc.

If you take some lexical grammar (Stanford/McParseface/whatever) and
generate a bunch of parses, run it through the ULL pipeline, and learn a
new lexis, then, ideally, if your software works well, then that *new*
lexis should get close to the original input lexis. And indeed, that is
what you are finding with F1-for-LG-English.

Your F1-for-LG-English results indicate that if you use LG as input, then
ULL correctly learns the LG lexis. That is a good thing.  I believe that
ULL will also be able to do this for any lexis... provided that you take
enough samples.  (There is a lot of evidence that your sample sizes are
much too small.)

Let's assume, now, that you take Stanford parses, run them through ULL,
learn a dict, and then measure F1-for-Stanford against parses made by
Stanford. The F1 should be high. Ideally, it should be 1.0.  If you measure
that learned lexis against LG, it will be lower - maybe 0.9, maybe 0.8,
maybe as low as 0.65. That is because Stanford is not LG; there is no
particular reason for these two to agree, other than in some general
outline: they probably mostly agree on subjects, objects and determiners,
but will disagree on other details (aux verbs, "to be", etc.)

Do you see what I mean now? The ULL pipeline should preserve the lexical
structure of language.  If you use lexis X as input, then ULL should
generate something very similar to lexis X as output.   You've done this
for X==LG. Do it for X=Stanford, McParseface, etc. If you do, you should
see F1=1.0 for each of these (well, something close to F1=1.0)

Now for part two:  what happens when X==sequential, what happens when
X==DNN-MI (aka "bertram weights") and what happens when X=="honest MI" ?

Let's analyze X==sequential first. First of all, this is not a lexical
grammar. Second of all, it is true that for English, and for just about
*any* language, "sequential" is a reasonably accurate approximation of the
"true grammar".  People have actually measured this. I can give you a
reference that gives numbers for the accuracy of "sequential" for 20
different languages. One paper measures "sequential" for Old English,
Middle English, 17th, 18th, 19th and 20th century English, and finds that
English becomes more and more sequential over time! Cool!

If you train on X==sequential and learn a lexis, and then compare that
lexis to LG, you might find that F1=0.55 or F1=0.6 -- this is not a
surprise.  If you compare it to Stanford, McParseface, etc. you will also
get F1=0.5 or 0.6 -- that is because English is kind-of sequential.

If you train on X==sequential and learn a lexis, and then compare that
lexis to "sequential", you will get ... kind-of-crap, unless your training
dataset is extremely large, in which case you might approach F1=1.0
However, you will need to have an absolutely immense training corpus size
to get this -- many terabytes and many CPU-years of training.  The problem
is that "sequential" is not lexical.  It can be made approximately lexical,
but that lexis would have to be huge.

What about X==DNN-Bert  and X==MI?  Well, neither of those are lexical,
either.  So you are using a non-lexical grammar source, and attempting to
extract a lexis out of it.  What will you get?  Well -- you'll get ...
something. It might be kind-of-ish LG-like. It might be kind-of-ish
Stanford-like. Maybe kind-of-ish HPSG-like. If your training set is big
enough (and your training sets are not big enough) you should get at least
0.65 or 0.7 maybe even 0.8 if you are lucky, and I will be surprised if you
get much better than that.

What does this mean?  Well, the first claim  is "ULL preserves lexical
grammars" and that seems to be true. The second claim is that "when ULL is
given a non-lexical input, it will converge to some kind of lexical output".

The third claim, "the Linas claim", that you love to reject, is that "when
ULL is given a non-lexical input, it will converge to the SAME lexical
output, provided that your sampling size is large enough".  Normally, this
is followed by a question "what non-lexical input makes it converge the
fastest?" If you don't believe the third claim, then this is a non-sense
question.  If you do believe the third claim, then information theory
supplies an answer: the maximum-entropy input will converge the fastest.
If you believe this answer, then the next question is "what is the maximum
entropy input?" and I believe that it is honest-MI+weighted-clique. Then
there is claim four: the weighted clique can be approximated  by MST.

It is now becoming clear to me that MST is a kind-of mistake, and that a
weighted clique would probably be better, faster-converging. Maybe. The
problem with all of this is rate-of-convergence, sample-set-size,
amount-of-computation.  it is easy to invent a theoretically ideal
NP-complete algorithm; its much harder to find something that runs fast.

Anyway, since you don't believe my third claim, I have a proposal. You
won't like it. The proposal is to create a training set that is 10x bigger
than your current one, and one that is 100x bigger than your current one.
Then run "sequential", "honest-MI" and "DNN-Bert" on each.  All three of
these will start to converge to the same lexis. How quickly? I don't know.
It might take a training set that is 1000x larger.  But that should be
enough; larger than that will surely not be needed. (famous last words.
Sometimes, things just converge slowly...)

-- Linas

> Still, we have got "surprize-surprize" with "gold reference corpus". Note,
> it still says "parses do matter but MST-Parses are as bad or as good as
> sequential but both are still not good enough". Also note, that it has been
> obtained just on 4 sentences which is not reliable evidence.
>
> Now, we are full-throttle working on proving your claim now with "silver
> reference corpus" - stay tuned...
>
> Cheers,
>
> -Anton
> 22.06.2019 5:38, Linas Vepstas:
>
> Anton,
>
> It's not clear if you fully realize this yet, or not, but you have not
> just one
> but two major breakthroughs here. I will explain them shortly, but first,
> can you send me your MST dictionary?  Of the three that you'd sent earlier,
> none had the MST results in them.
>
> OK, on to the major breakthroughs... I describe exactly what they are in
> the
> attached PDF.  It supersedes the PDF I had sent out earlier, which
> contained
> invalid/incorrect data. This new PDF explains exactly what works, what
> you've found.
> Again, its important, and I'm very excited by it.  I hope Ben is paying
> attention,
> he should understand this.  This really paves the way to forward motion.
>
> BTW, your datasets that "rock"? Actually, they suck, when tested
> out-of-training-set.
> This is probably the third but more minor discovery: the Gutenberg
> training set
> offers poor coverage of modern English, and also your training set is
> wayyyy too small.
> All this is fixable, and is overshadowed by the important results.
>
> Let me quote myself for the rest of this email.  This is quoted from the
> PDF.
> Read the whole PDF, it makes a few other points you should understand.
>
> ull-lgeng
>
> Based on LG-English parses: obtained from
> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-FULL-ALE-dILEd-2019-04-10/context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/
>
> I believe that this dictionary was generated by replacing the MST step
> with a parse where linkages are obtained from LG; these are then busted up
> back into disjuncts. This is an interesting test, because it validates the
> fidelity of the overall pipeline. It answers the question: “If I pump LG
> into the pipeline, do I get LG back out?” and the answer seems to be “yes,
> it does!” This is good news, since it implies that the overall learning
> process does keep grammars invariant. That is, whatever grammar goes in,
> that is the grammar that comes out!
>
> This is important, because it demonstrates that the apparatus is actually
> working as designed, and is, in fact, capable of discovering grammar in
> data! This suggests several ideas:
>
> * First, verify that this really is the case, with a broader class of
> systems. For example, start with the Stanford Parser, pump it through the
> system. Then compare the output not to LG, but to Stanford parser. Are the
> resulting linkages (the F1 scores) at 80% or better? Is the pipeline
> preserving the Stanford Grammar? I'm guessing it does...
>
> * The same, but with Parsey McParseface.
>
> * The same, but with some known-high-quality HPSG system.
>
> If the above two bullet points hold out, then this is a major
> breakthrough, in that it solves a major problem. The problem is that of
> evaluating the quality of the grammars generated by the system. To what
> should they be compared? If we input MST parses, there is no particular
> reason to believe that they should correspond to LG grammars. One might
> hope that they would, based, perhaps, on some a-priori hand-waving about
> how most linguists agree about what the subject and object of a sentences
> is. One might in fact find that this does hold up to some fair degree, but
> that is all. Validating grammars is difficult, and seems ad hoc.
>
> This result offers an alternative: don't validate the grammar; validate
> the pipeline itself. If the pipeline is found to be structure-preserving,
> then it is a good pipeline. If we want to improve or strengthen the
> pipeline, we know have a reliable way of measuring, free of quibbles and
> argumentation: if it can transfer an input grammar to an output grammar
> with high-fidelity, with low loss and low noise, then it is a quality
> pipeline. It instructs one how to tune a pipeline for quality: work with
> these known grammars (LG/Stanford/McParse/HPSG) and fiddle with the
> pipeline, attempting to maximize the scores. Built the highest-fidelity,
> lowest-noise pipeline possible.
>
> This allows one to move forward. If one believes that probability and
> statistics are the correct way of discerning reality, then that's it: if
> one has a high-fidelity corpus-to-grammar transducer, then whatever grammar
> falls out is necessarily, a priori a correct grammar. Statistics doesn't
> lie. This is an important breakthrough for the project.
>
> ull-sequential
>
> Based on "sequential" parses: obtained from
> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-FULL-SEQ-dILEd-2019-05-16-94/GL_context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/
>
> I believe that this dictionary was generated by replacing the MST step
> with a parse where there are links between neighboring words, and then
> extracting disjuncts that way. This is an interesting test, as it leverages
> the fact that most links really are between neighboring words. The sharp
> drawback is that it forces each word to have an arity of exactly two, which
> is clearly incorrect.
>
> ull-dnn-mi
>
> Based on "DNN-MI-lked MST-Parses": obtained from
> http://langlearn.singularitynet.io/data/aglushchenko_parses/GCB-GUCH-SUMABS-dILEd-2019-05-21-94/GL_context:2_db-row:1_f1-col:11_pa-col:6_word-space:discrete/
>
> I believe that this dictionary was generated by replacing the MST step
> with a parse where some sort of neural net is used to obtain the parse.
>
> Comparing either of these to the ull-sequential dictionary indicates that
> precision is worse, recall is worse, and F1 is worse. This vindicates some
> statements I had made earlier: the quality of the results at the MST-like
> step of the process matters relatively little for the final outcome. Almost
> anything that generates disjuncts with slightly-better-than-random will do.
> The key to learning is to accumulate many disjuncts: just as in radio
> signal processing, or any kind of frequentist statistics, to integrate over
> a large sample, hoping that the noise will cancel out, while the invariant
> signal is repeatedly observed and boosted.
>
> On Thu, Jun 20, 2019 at 11:11 PM Anton Kolonin @ Gmail <[email protected]>
> wrote:
>
>> It turns out the difference on if we apply MWC for GL and GT both (lower
>> block) or for GT only (upper block) is miserable - applying it for GL make
>> results 1% better.
>>
>> So far, testing on full LG-English parses (including partially parsed) as
>> a reference:
>>
>> As we know, MWC=2 is much better than MWC=1 and no improvement further.
>>
>> "Sequential parses" rock, MST and "random" parses suck.
>>
>> Pearson(parses,grammar) = 1.0
>>
>> Alexey is running this with "silver standard" for MWC=1,2,3,4,5,10
>>
>> -Anton
>> --
>> You received this message because you are subscribed to the Google Groups
>> "lang-learn" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/lang-learn/4dfac49f-a6b5-f5ab-6fb0-d0be96ee77ef%40gmail.com
>> <https://groups.google.com/d/msgid/lang-learn/4dfac49f-a6b5-f5ab-6fb0-d0be96ee77ef%40gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA34sz8okB%2B3UGM0XO9CsYu5U0MBRL0A0fedTDtN%2BNc7Mfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: finished experiment on GC - GL/GT on fully parsed sentences with MWC=1-5

Reply via email to