Hi Anton, sorry for very late reply.
On Tue, Apr 23, 2019 at 8:25 PM Anton Kolonin @ Gmail <[email protected]>
wrote:
> Linas, how would you "weight the disjuncts"?
>
> We know how to weight the words (by frequency), and word pairs (by MI).
>
> But how would you weight the disjuncts?
>
That is a very good question. There are several (many) different kinds of
weighting schemes. I do not know which is best. That is the point where I
last left things, half-a-year-ago, now. But first, some theoretical
preliminaries.
Given any ordered pair at all -- thing-a and thing-b, you can compute MI
for the pair. Thing-a does not have to be the same type as thing-b. In this
case, the pair-of-interest is (word, one-of-the-disjuncts-on-that-word).
Write it as (w,d) for short. The MI is defined same as always: MI(w,d) =
p(w,d)/p(*,d) p(w,*) where p is the frequency of observation:
p(w,d)=N(w,d)/N(*,*) as always. N is the observation count and * the
wild-card sum.
The pipeline code already computes this; I'm not sure if you use it or not.
Its in the `(use-modules (opencog matrix))` module; it computes MI for
pairs-of-anythings in the atomspace. Its generic in that one can set up
thing-a to be some/any collection of atoms in the atomspace, and thing-b
can be any other collection of atoms, and it will start with the counts
N(thing-a, thing-b) and compute probabilities, marginal probabilities,
conditional probabilities, MI, entropies, "the whole enchilada" of
statistical you can do on pairs of things. Its called "matrix" because
"pairs of things" looks like an ordinary matrix [M]_ij
Sounds boring, but here's the kicker: `(opencog matrix)` is designed to
work for extremely sparse matrixes, which *every other package (e.g. scipy)
will choke on. For example: if thinga-thing-b=words, and there are 100K
words, then M_ij potentially has 100K x 100K = 10 giga-entries which will
blow up RAM if you tried to store the whole matrix. In practice, 99.99% of
them are zero (the observation count of N(left-word, right-word) is zero
for almost all word pairs). So the atomspace is being used as storage for
hyper-sparse matrixes, and you can layer the matrix onto the atomspace any
way that you want. Its like a linear cross-section through the atomspace.
linear, vector, etc. etc.
OK, so .. the existing language pipeline computes MI(w,d) already, and
given a word, and a disjunct on that word, you can just look it up. ...
but if you are clustering words into a cluster, then the current code does
not currently recompute MI(g,d) for some word-group ("grammatical class")
g. Or maybe it does recompute, but it might be incomplete or untested, or
different because maybe your code is different. For the moment, let me
ignore clustering....
So, for link-grammar, just take -MI(w,d) and make that the link-grammar
"cost". Minus sign because larger-MI==better.
How well will that work? I dunno. This is new territory to me. Ben long
insisted on "surprisingness" as a better number to work with. I have not
implemented surprisingness in the matrix code; nothing computes it yet.
Besides using MI, one can invent other things. I strongly believe that MI
is the correct choice, but do not have any concrete proof.
If you do have grammatical clusters g, then perhaps one should use
MI(w,g)+MI(g,d) or maybe just use MI(g,d) by itself. Likewise, if the
disjunct 'd' is the result of collapsing-together a bunch of single-word
disjuncts, maybe you should add MI(disjunct-class, single-disjunct) to the
cost. I dunno. I was half-way through these experiments when Ben
re-assigned me, so this is all new territory.
-- Linas
> -Anton
>
>
> 24.04.2019 4:13, Linas Vepstas пишет:
>
>
>
> On Tue, Apr 23, 2019 at 5:00 AM Ben Goertzel <[email protected]> wrote:
>
>> > On Mon, Apr 22, 2019 at 11:18 PM Anton Kolonin @ Gmail <
>> [email protected]> wrote:
>> >>
>> >>
>> >> We are going to repeat the same experiment with MST-Parses during this
>> week.
>> >
>> >
>> > The much more interesting experiment is to see what happens when you
>> give it a known percentage of intentionally-bad unlabelled parses. I claim
>> that this step provides natural error-reduction, error-correction, but I
>> don't know how much.
>>
>>
>> If we assume roughly that "insufficient data" has a similar effect to
>> "noisy data", then the effect of adding intentionally-bad parses may
>> be similar to the effect of having insufficient examples of the words
>> involved... which we already know from Anton's experiments. Accuracy
>> degrades smoothly but steeply as number of examples decreases below
>> adequacy.
>>
>
> They are effects that operate at different scales. In my experience, a
> word has to be seen at least five times before it gets linked
> mostly/usually accurately. The reason for this is simple: If it is seen
> only once, it has an equal co-occurance with all of it's nearby-neighbors:
> any neighbor is equally likely to be the right link (so for N neighbors, a
> 1/N chance of guessing correctly). When a word is seen five times, the
> collection of nearby neighbors has grown into the several-dozens, and of
> those several dozen, only 1 or 2 or 3 will have been seen repeatedly. The
> correct link is to one of the repeats. And so, "from first principles", I
> can guess that 5 is the minimum number of observations to arrive at an MST
> parse that is better than random-chance. This effect is operating at the
> word-pair level, and determines the accuracy of MST.
>
> The other effect is operating at the disjunct level. Consider a single
> word, and 10 sentences containing that word. Assume each sentence has an
> unlabelled parse, which might be wrong. Assume that word is linked
> correctly 7 times, and incorrectly 3 times. Of those 3 times, only some of
> the links will be incorrect (typically, a word has more than one link going
> to it). When building disjuncts, this leads to 7 correct disjuncts, and 3
> that are (partly) wrong.
>
> Consider an 11th "test sentence" containing that word. If you weight each
> disjunct equally, then you have a 7/10 chance of using good disjuncts and a
> 3/10 chance of using bad ones. Solution: do not weight them equally! But
> how to do this? Short answer: the MI mechanism, w/ clustering, means that
> on average, the 7 correct disjuncts will have a high MI score, the 3 bad
> ones will have a low MI score, and thus, on the test sentence, it will be
> far more likely that the correct disjuncts get used. The final accuracy
> should be better than 7/10.
>
> This depends on a key step: correctly weighting disjuncts, so that this
> discrimination kicks in. Without discrimination, the resulting LG
> dictionary will have accuracy that is no better than MST (and maybe a bit
> worse, due to other effects).
>
>
>
>> ***
>> My claim is that this mechanism acts as an "amplifier" and a "noise
>> filter" -- that it can take low-quality MST parses as input, and
>> still generate high-quality results. In fact, I make an even
>> stronger claim: you can throw *really low quality data* at it --
>> something even worse than MST, and it will still return high-quality
>> grammars.
>>
>> This can be explicitly tested now: Take the 100% perfect unlaballed
>> parses, and artificially introduce 1%, 5%, 10%, 20%, 30%, 40% and 50%
>> random errors into it. What is the accuracy of the learned grammar? I
>> claim that you can introduce 30% errors, and still learn a grammar
>> with greater than 80% accuracy. I claim this, I think it is a very
>> important point -- a key point - but I cannot prove it.
>> ***
>>
>> Hmmm. So I am pretty sure you are right given enough data.
>>
>> However, whether this is true given the magnitudes of data we are now
>> looking at (Gutenberg Childrens Corpus for example) is less clear to
>> me
>>
>
> Its a fairly large corpus - what 750K sentences? and 50K unique words? (of
> which only 5K or 8K were seen more than five times!!) So I expect accuracy
> to depend on word-frequency: If the test sentences only contain words
> from that 5K vocabulary, they will have (much) higher accuracy than
> sentences that contain words that were seen 1-2 times.
>
> I also expect that the disjuncts on the most frequent 1K words to be of
> much higher accuracy, than the next 4K -- So, for test sentences containing
> only words from the top 1K, I expect those to have high accuracy. For
> longer sentences containing infrequent words, I expect most of it to be
> linked correctly, except for the portion near the infrequent word, where
> the error rate goes up.
>
> One of the primary reason to perform clustering is to "amplify frequency"
> - by grouping together words that are similar, the grand-total counts go
> up, the probably-correct disjunct counts shoot way up, while the
> maybe-wrong disjunct counts stay scattered and low, never coalescing.
>
>
>> Also the current MST parses are much worse than "30% errors" compared
>> to correct parses.
>
>
> Did Deniz Yuret falsify his thesis data? He got better than 80% accuracy;
> we should too.
>
>
>> So even if what you say is correct, it doesn't
>> remove the need to improve the MST parses...
>>
>
> Actually, one of my proposals from the previous block of emails was to
> make MST worse! I'm so sick of hearing about MST that I proposed getting
> rid of it, and replacing it with something of lower-quality, and focus on
> the clustering and disjunct weighting schemes to improve accuracy.
>
> I'm fairly certain that replacing MST with something lower-quality will
> still work well. If that is not the case, then that means that the
> disjunct-processing stages are somehow being done wrong. The final result
> should not depend very much on the accuracy of MST. And this does not
> require a huge corpus, either. If there is a strong dependence on MST,
> something is seriously wrong, seriously broken in the disjunct-processing
> stages. We need to spend energy on fixing that brokenness and not on
> making MST better.
>
> (And I would not be surprised that the disjunct-processing stages are
> broken, mostly because I have not seen any detailed description of how they
> are being performed. The details there really matter, they really affect
> outcomes, but those details are not being discussed.)
>
> To repeat myself-- these later stages are where all the action is -- if
> these later stages are weak, nothing can be built on them.
>
> --linas
>
>
>> But you are right -- this will be an interesting and important set of
>> experiments to run. Anton, I suggest you add it to the to-do list...
>>
>> -- Ben
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
> --
> You received this message because you are subscribed to the Google Groups
> "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lang-learn/CAHrUA35datvKktVoaJgQk2fbq36t7a2wvWP3EXJ3Wrwaw8UtcQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA35datvKktVoaJgQk2fbq36t7a2wvWP3EXJ3Wrwaw8UtcQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> -Anton Kolonin
> skype: akolonin
> cell:
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>
--
cassette tapes - analog TV - film cameras - you
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/CAHrUA378CCcTB1DDx0OkaU9-2TiEOtHA%3DKPdMidRXMTXnm5p%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.