[opencog-dev] Re: 100 sentences for GC

Linas Vepstas Tue, 02 Apr 2019 10:09:50 -0700

On Tue, Apr 2, 2019 at 12:57 AM Anton Kolonin @ Gmail <[email protected]>
wrote:


> Hi Linas,
>
> Are you saying that "while ULL team has found strong linear correlation
> between A) quality (F1) on input parses and B) quality (F1) of the output
> parses based on the grammar learned from the input parses, this phenomenon
> is due to the fact that they test on the entire input corpus so this
> phenomena should go away once they test on gold standard corpus consisting
> only of sentences with high-frequency words"?
>

I am saying that I have not seen any evidence at all that you actually
constructed or counted disjuncts, or that you clustered disjuncts, or that
you controlled or managed counting in any way.

So -- you did something ... but I don't understand what that "something"
is, and, based on these conversations, that "something" does not match up
with what I had hoped that you would be doing.

It's not just high-frequency words.  Its also how you perform clustering.
Are you using MI for that? or cosines for that?  Are you handling
word-sense disambiguation, or not? How did you handle WSD? Through
orthogonalization of cosines? Through maximizatino of MI? By computing a
markov vector? Some other way?  Did you perform any data cuts during
orhtogonalization/maximization? What kind of cuts were they? How do the
cuts affect the F1-score?

All of these things deserve "instrumental verification". Without them, I
don't know how to assign any meaning to F1-scores (or ROC curves, which you
haven't shown - and even if you did show them, I would not know what they
mean, until the above questions are resolved.)

So I've got this ball of questions, and I'm getting unclear, confusing
answers to them.

-- Linas

Best regards,
>
> -Anton
>
>
> 02.04.2019 5:38, Linas Vepstas пишет:
>
> OK, There's clearly a lot ow work happening in linguistics these days,
> that I have fallen behind on reading.
>
> The nature of the conversations here has been frustrating, because so far,
> it sounds like an attempt to evade the  "central limit theorem" --
> https://en.wikipedia.org/wiki/Central_limit_theorem
>
> There are two related ideas I'm trying to get across: one is that if you
> make enough observations of a phenomenon, eventually, the central-limit
> theorem kicks in, and smooths over random variations.  Specifically, I
> claim that, despite MST being imperfect, a large number of observations
> should smooth over the imperfections. I believe this to be true, (but I
> could be wrong).
>
> The other idea is that the golden test corpus must avoid accidentally
> testing disjuncts far away from the central limit -- to avoid, as it were,
> making statements analogous to "Well, I flipped the coin three times, and I
> did not get 50-50 odds, therefore the theory doesn't work". You have to
> flip the coin at least N times, for some large N.  Here, for MST, we don't
> know  how big N has to be, we don't have a good plan for determining N.
> It's worse, cause everything is Zipfian aka 1/f noise. It is possible that
> BERT or other approaches allow smaller values of N to work, but this is
> also not clear.
>
> Its also not clear that BERT would converge to a different limit than MST
> - the central-limit theorem says there is only one limit -- not two. But
> perhaps I'm misapplying it, perhaps I'm neglecting some important effect.
> Without measurements, its hard to guess what that effect is (if it even
> exists).
>
> Anyway, I have a backlog of half-a-dozen important unread papers, so I'll
> try to get around to that "real soon now".
>
> --linas
>
>
>
> On Mon, Apr 1, 2019 at 12:15 AM Ben Goertzel <[email protected]> wrote:
>
>> "Replacing MST by DNN/BERT" is a strange way to put it...
>>
>> DNN/BERT builds a pretty complex and comprehensive language model,
>> much beyond what is done by calculation of MI values and similar
>>
>> The extraction of a parse dag satisfying syntactic constraints (no
>> links cross, covering all words in the sentence, connected graph) is a
>> conceptually simple step, and nobody is spending much time on this
>> step indeed...
>>
>> The question of how to assign a quantitative weight to the relation
>> btw two word-instances in a sentence, taking into account the specific
>> context in that sentence, but also the history of co-utilization of
>> those words (or other similar words), is less conceptually simple and
>> this is one place I think DNN language models can help
>>
>> Using MST or similar parsing based on numbers exported from DNN
>> language models is one way of extracting symbolic-ish structured
>> knowledge from these big messy subsymbolic probabilistic language
>> models...
>>
>> The DNNs in use now like BERT do not really satisfy me on a
>> theoretical or conceptual level, but they have been tuned to work
>> pretty nicely and they have been implemented pretty efficiently on
>> multi-GPU hardware -- so, given this and given the quality of the
>> recent practical results obtained with them -- I consider it well
>> worth exploring how to use them as tools in our pursuits for grammar
>> and semantics learning
>>
>> -- Ben
>>
>> On Mon, Apr 1, 2019 at 2:07 PM Linas Vepstas <[email protected]>
>> wrote:
>> >
>> >
>> >
>> > On Sun, Mar 31, 2019 at 10:51 PM Anton Kolonin @ Gmail <
>> [email protected]> wrote:
>> >>
>> >> Hi Linas, I like this thread more and more :-)
>> >
>> > I don't. I use a lot of CAPITALIZED WORDS below.  There is a deep and
>> dark fundamental misunderstanding, and I am sometimes at wits end trying to
>> figure out why, and how to explain things in an understandable fashion.
>> >>
>> >> >But somehow, I suspect... Isn't this why OpenCog has "unified rule
>> engine" (URE) instead of link grammar at its core,
>> >>
>> >> Linas, the "extraction of phrasemes" goal approaching has been
>> discussed exactly in terms of MST->GL->URL on the last fall in Hong Kong
>> discussion:
>> https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit
>> >>
>> >> That is:
>> >>
>> >> 1) Do MST-parsing to get word links proto-disjuncts
>> >>
>> >> 2) Do Grammar Learning to cluster and conclude word categories and
>> rules with disjuncts
>> >>
>> >> 3) Do URE-kind-of-thing to build the rules into "phrasemes" or
>> "sections" or "patterns".
>> >
>> > Yes.
>> >>
>> >> However, your current discourse and our current results just show that
>> "no one is be able to do reasonable MST-parsing" so the above is just waste
>> of time, correct?
>> >
>> > No. Very much no.  I'm saying the opposite of that. You can replace MST
>> by almost *ANYTHING* else, and the quality of your results WILL NOT CHANGE!
>> >
>> > If the quality of your results depends on the quality of MST, you are
>> DOING SOMETHING WRONG!
>> >
>> > I'm utterly flabbergasted. I don't know how many more times I can say
>> this: stop wasting time on this unimportant step!
>> >>
>> >> At the time we speak, Ben, Alexely, Sergey and Asuares are trying to
>> use DNN/BERT magic to do the trick 1.
>> >
>> > I want to call this "a complete waste of time". It will almost surely
>> not improve the quality of the results!  I don't understand why four smart
>> people think that replacing MST by BERT will make any difference at all!
>> It should not matter!  Nothing depends on this step! Anything at all,
>> anything with a probability better than random chance, is sufficient!  Why
>> isn't this obvious?
>> >
>> > If Ben is reading this: I recall talking to Ben about this in an
>> ice-cream shop in Berlin, for an AGI conference, and he seemed to
>> understand back then.  I have no idea why he changed his mind.  I really do
>> not understand why everyone spends so much time obsessing about MST. Is
>> this a "color of the bike shed" problem?
>> https://en.wikipedia.org/wiki/Law_of_triviality
>> >
>> > MST-vs.-BERT==color-of-bike-shed
>> >
>> > Just use MST. It's simple. It works. It gives good results.  Stop
>> trying to improve it.  The interesting problems are elsewhere!  Just use
>> MST, and move on to the good stuff!
>> >>
>> >> To my mind, that may get possible only if the DNN/BERT magic do the
>> trick having the steps 2 and 3 done under the hood. If this is done, in
>> such case, we don't need to do 2 and 3 after we have the DNN/BERT-based
>> model, because we can simply "milk-out" the grammar rules out of DNN/BERT
>> micelium for that. And we don't need the ULL as well by the way, because we
>> just need DNN/BERT and rows of different sorts of milk machines around it.
>> >
>> > So why are you bothering to work on ULL?
>> >>
>> >> So, instead of solving the problem of constructing the pipeline for
>> learning grammar from raw text we need to solve the problem of milking the
>> grammar out of DNN/BERT model trained on these texts, right?
>> >
>> > Because I don't think that you know how to milk lexical functions out
>> of DNN/BERT -- We've wasted more than a year talking about MST.  Instead of
>> endlessly talking about MST, you could have  JUST USED IT, WITHOUT ANY
>> MODIFICATIONS, gotten good results, and spent the year working on something
>> interesting!
>> >
>> > Again: replacing MST by DNN/BERT with something else will NOT IMPROVE
>> the accuracy!  You'll have exactly the same accuracy as before, and if your
>> accuracy improves, it is because you are doing something wrong!
>> >
>> >> However, either way, we need to understand algorithmic machinery of
>> how the links assemble in disjuncts and disjuncts assemble into sections,
>> through the universe-scale combinatorial explosion.
>> >
>> > No. That is the OPPOSITE of what ACTUALLY HAPPENS!!!!
>> >>
>> >> And I agree that clustering and categorizing word and links (and then
>> disjuncts and sections, right) is part of the process - explicitly in ULL
>> pipeline or implicitly deep in DNN/BERT darkness.
>> >
>> > It is NOT DEEP AND DARK.  I wrote not one but TWO PAPERS on this,
>> CASTING LIGHT ON THAT DARKNESS
>> >
>> > I'm frustrated to the 43rd degree on why I cannot seem to have a
>> reasonable conversation with any other human being about any of this.
>> >
>> > -- Linas
>> >
>> >> Cheers,
>> >>
>> >> -Anton
>> >>
>> >>
>> >> 01.04.2019 9:17, Linas Vepstas:
>> >>
>> >>
>> >>
>> >> On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]> wrote:
>> >>>
>> >>> Linas Vepstas wrote:
>> >>>
>> >>> >... knowledge extraction can be done generically, and not just on
>> language.
>> >>>
>> >>> If link grammar would be Turing complete, this might be possible
>> right away.
>> >>
>> >>
>> >> In my experience, thinking about Turing completeness is unproductive
>> and a distraction.
>> >>
>> >>> But somehow, I suspect... Isn't this why OpenCog has "unified rule
>> engine" (URE) instead of link grammar at its core,
>> >>
>> >>
>> >> No. It has the rule-engine because back then, I did not understand
>> sheaves.  I'm starting to think that the rule engine is a strategic
>> mistake. The original idea is that rule-application is the main conceptual
>> abstraction of term-rewriting.  One rewrites, or proves theorems by
>> applying sequences of rules.  It turns out that discovering the right
>> sequence is hard. Finding correct long sequences is hard - a combinatorial
>> explosion.
>> >>
>> >> The openpsi system addresses some of these issues. Unfortunately, it's
>> current implementation is a tangle of rule-selection mechanisms, and
>> theories of human psychology. It's probably better than the URE, but is
>> currently not as powerful.
>> >>
>> >> I'm trying to place a theory of sheaves as a replacement for URE, and
>> as the natural generalization of openpsi, but I've successfully
>> self-sabotaged myself in these efforts.
>> >>
>> >>>
>> >>> and with URE things get much more complicated. I'm sorry, but that is
>> still a Gordian knot to me, considering all of my modest knowledge.
>> >>
>> >>
>> >> We all have modest knowledge. That is the nature of the human
>> condition.
>> >>
>> >>>
>> >>> On the other hand, if someone really smart would provide automatic
>> grammar extraction by means of unrestricted grammar, I believe that would
>> be it.
>> >>
>> >>
>> >> Yes, that is the goal of the language-learning project.  However, as
>> noted in my last email (on the link-grammar list) it is not enough to just
>> learn a semi-Thue system, declare victory, and go home.  The example I gave
>> there:
>> >>
>> >>   "I think that you should give that car a second look"
>> >>   "you should really give that song a second listen"
>> >>   "maybe you should give Sue a second chance".
>> >>
>> >> Learning to parse these "set phrases" or phrasemes is equivalent to
>> learning a semi-Thue system; however, its not enough to realize that all
>> three are forms of advice-giving, having "conserved" or "fixed" regions "x
>> YOU SHOULD y GIVE z SECOND w" where z is very highly variable having
>> millions of variations, and w only has a few dozen allowed variations.
>> Note that the words "fixed", "conserved", "variable" are words used in
>> genetics and proteomics and antibody structure. Its the same idea.
>> >>
>> >> The goal of learning lexical functions (LF's) is to learn that all
>> three are advice-giving forms, and also to learn what is, and what can be
>> plugged in for x,y,z,w.   So, although a super-whiz-bang grammar learner
>> capable of learning context-sensitive languages should be able to learn "x
>> YOU SHOULD y GIVE z SECOND w", it still will not know the *meaning* of this
>> phrase.  To know the *meaning*, you have to know the acceptable ranges (as
>> fuzzy-sets) of x,y,z,w.
>> >>
>> >> To conclude, thinking about Turing-completeness is a waste of time,
>> because Turing completeness only tells you that "x YOU SHOULD y GIVE z
>> SECOND w" is recursively enumerable; it does not tell you what it actually
>> means.
>> >>
>> >> Put another way:  having a universal Turing machine is not the same as
>> knowing how some particular program works. Automagically learning a
>> context-sensitive grammar is not enough to know what that grammar is
>> "saying/doing".
>> >>
>> >> -- Linas
>> >>
>> >>>
>> >>>
>> >>> Thank you,
>> >>> Ivan V.
>> >>>
>> >>>
>> >>> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]>
>> napisao je:
>> >>>>
>> >>>> Ben, Linas,
>> >>>>
>> >>>> >But we know that MST parsing is shit.  Stop wasting time on MST or
>> trying to "improve" it.
>> >>>>
>> >>>> I think that sounds like kind of support for the concept of "dumb
>> explosive parsing" being advocated for 1+ year ago:
>> >>>>
>> >>>>
>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy
>> >>>>
>> >>>> I also agree we other Linas'es reasoning in this thread. I would
>> consider giving it a try starting next month if we don't have a
>> breakthrough with DNN-MI-milking-based-MST-Parsing by that time.
>> >>>>
>> >>>> > can be done generically, and not just on language
>> >>>>
>> >>>> I think everyone in bio-informatics dreams of extracting secrets of
>> "dark side of the genome" with something like that ;-)
>> >>>>
>> >>>> Cheers,
>> >>>>
>> >>>> -Anton
>> >>>>
>> >>>>
>> >>>> 28.03.2019 1:24, Linas Vepstas пишет:
>> >>>>
>> >>>> Hi Anton,
>> >>>>
>> >>>> I've cc'ed the link-grammar mailing list, because I describe below
>> some concepts for word-sense disambiguation. I'm also cc'ing the opencog
>> mailing list and ivan vodisek, because after studying hilbert systems, I
>> think he's ready to think about how knowledge extraction can be done
>> generically, and not just on language.
>> >>>>
>> >>>> -- Linas
>> >>>>
>> >>>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <
>> [email protected]> wrote:
>> >>>>>
>> >>>>> Hi Linas,
>> >>>>>
>> >>>>> >I'd call it "interesting", but maybe not "golden"
>> >>>>>
>> >>>>> These are randomly selected sentences from "Gutenberg Children"
>> corpus:
>> >>>>>
>> >>>>>
>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>> >>>>>
>> >>>>> "Gutenberg Children silver standard" is LG-English parses:
>> >>>>>
>> >>>>>
>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>> >>>>>
>> >>>>> "Gutenberg Children gold standard" is subset of "silver standard"
>> with semi-random selection of sentences skipping direct speech and doing
>> manual verification of the links.
>> >>>>>
>> >>>>> So as long as we are training on "Gutenberg Children" corpus,
>> having the test on the same "Gutenberg Children" seems reasonable, right?
>> >>>>
>> >>>>
>> >>>> Yes. You still need to verify that each word in the "golden" corpus
>> occurs at least N=10 or 20 times in the training corpus. The dependency of
>> accuracy on N is not generally known, but it is very clear that if a word
>> occurs only N=3 times in the training corpus, then whatever is learned
>> about it will be very low quality.
>> >>>>
>> >>>>>
>> >>>>> But thanks, we may have put mire effort in removal of ancient
>> constructions and words even if these are present in the corpus.
>> >>>>
>> >>>> If you consistently train on 19th century literature, and then
>> evaluate 19th-century literature comprehension, that's fine.  Just don't
>> expect it to work for 21st century blog posts.
>> >>>>
>> >>>> The strongest effect will be the N=number of observations effect.
>> >>>>
>> >>>>>
>> >>>>> >Anyway -- you only indicate pair-wise word-links. Is the omission
>> of disjuncts intentional?
>> >>>>>
>> >>>>> If you have all links in the sentence, you can construct all of the
>> disjuncts with o ambiguity, correct?
>> >>>>
>> >>>> No, but only because you did not indicate the link-type.  The whole
>> point of a clustering step is to obtain a link-type; if you discard it, you
>> will never get  better-than-MST results. The link-type is critical for
>> obtaining the word-classes.  The whole point of learning is to learn the
>> word-classes; you've learned very little, if you know only word-pairs.
>> >>>>
>> >>>> Consider this example:
>> >>>>
>> >>>> I saw wood
>> >>>> I saw some wood
>> >>>>
>> >>>> A solution that would be "almost perfect" (or "golden") would be
>> this:
>> >>>>
>> >>>> saw: {performer-of-actions}- & {sculptable-mass}+;
>> >>>> saw: {observer}-  & {viewable-thing}+;
>> >>>>
>> >>>> These disambiguate the two different senses of the word "saw".  It's
>> impossible to have word-sense disambiguation without actually having these
>> disjuncts.  The word-pairs alone are not sufficient to report the link-type
>> connecting the words.  Clustering gives the other dictionary entries:
>> >>>>
>> >>>> I: {performer-of-actions}+ or {observer}+;
>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- &
>> {viewable-thing}-);
>> >>>> some: {quantity-determiner}+;
>> >>>>
>> >>>> Thus, the pronoun "I" also belong to two different word-sense
>> categories: performers and observers.  Compare to:
>> >>>>
>> >>>> "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of
>> actions" but cannot be an "observer".
>> >>>> "The dog saw some wood" -- dogs can be observers. They can perform
>> some actions; like run, jump, but they cannot saw, hammer, cut, stab.
>> >>>>
>> >>>> The link-type is absolutely crucial to understanding a word.  The
>> language-learning project is all about learning the link-types. Without
>> correct link-type assignments, you cannot have correct parses.
>> >>>>
>> >>>> ... which is 100% of the problem with MST.  The problem with MST is
>> not so much that "its not accurate" -sure, it is not terribly accurate. But
>> even if MST or some MST-replacement was 100% accurate, it would still be
>> "wrong" because it fails to indicate the link-type.  If you want to
>> understand a sentence, you MUST know the link-types!
>> >>>>
>> >>>> Otherwise, you just have "green ideas sleep furiously", which
>> parses, but only because the link types have been erased, or made stupid.
>> Here's a stupid grammar:
>> >>>>
>> >>>> ideas:  {adjective}- & {verb}+;
>> >>>> green: {adjective}+;
>> >>>>
>> >>>> which allows "green ideas" to parse.  But of course, this is wrong;
>> it should have been:
>> >>>>
>> >>>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
>> >>>> green: {physical-object-modifier}+;
>> >>>>
>> >>>> and now it is clear that "green ideas" cannot parse, because the
>> link-types clash.
>> >>>>
>> >>>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...)
>> you will get very low quality grammars.
>> >>>>
>> >>>> * If you cluster to 200 or 300 clusters, you get sort-of-OK
>> grammars. This is what deep-learning/neural-nets do: this is why the
>> deep-learning systems seem to give nice results: 200 or 300 features is
>> enough to start having adequate functional distinctions (e.g. the famous
>> "king - male+female=queen" example, or "paris-france+germany=berlin"
>> example)
>> >>>>
>> >>>> * If you cluster to 3K to 8K clusters, you start having a quite
>> decent model of language
>> >>>>
>> >>>> * Note that wordnet has 117K "synsets".
>> >>>>
>> >>>> Note that in the above example:
>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- &
>> {viewable-thing}-);
>> >>>>
>> >>>> the things in the curly-braces are effectively "synsets".
>> >>>>
>> >>>> The next set of goal-posts is to have disjuncts, of maybe low-medium
>> quality, and use these to extract ontologies.  e.g.
>> >>>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}
>> >>>>
>> >>>> You can try to do this by clustering but there are probably better
>> ways of discovering ontology.
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> >Also -- no hint of any word-classes or part-of-speech tagging?
>> This is surely important to evaluate as well, or is this to be done in some
>> other way?  i.e. to evaluate if "Pivi" was correctly clustered with other
>> given names?  Or that lama/llama was clustered with other four-legged
>> animals?
>> >>>>>
>> >>>>> We don't have that in MST-Parsing, right? We need this corpus to
>> assess the quality of the MST-Parsing so we don't need part-of-speech
>> information for that.
>> >>>>
>> >>>> But we know that MST parsing is shit.  Stop wasting time on MST or
>> trying to "improve" it. We already know that it is close to a high-entropy
>> path to structure; trying to squeeze a few more percent of entropy is not
>> worth the effort, not at this time.  Focus on finding a high-entropy
>> structure extraction algorithm, don't waste time on MST.
>> >>>>
>> >>>> You should be focusing on extracting disjuncts, word-classes,
>> word-senses, and trying to improve the quality of those.  If you obtain a
>> high-entropy path to these structures, the quality of your parses will
>> automatically improve.  Focus on the entropy numbers. Try to maximize that.
>> >>>>
>> >>>>> The clustering is able to do that anyway - see the graphs in the
>> end of the last year report:
>> >>>>>
>> >>>>>
>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>> >>>>>
>> >>>>> >Also -- I can't tell -- is it free of loops, or are loops
>> allowed?  Allowing loops tends to provide stronger, more accurate parses.
>> Loops act as constraints.
>> >>>>>
>> >>>>> The loops and crossing links are not allowed in the MST-Parser now.
>> If we allow them in the test corpus, how could it make assessment of
>> MST-Parses better?
>> >>>>>
>> >>>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's
>> directions.
>> >>>>
>> >>>>
>> >>>> Not to say bad things about Ben, but I'm certain he has not actually
>> thought about this problem very much. He is very very busy doing other
>> things; he is not thinking about this stuff.  I have repeatedly tried to
>> explain the issues to him, and its quite clear that he is far away from
>> understanding them, from working at the level that I would like to have you
>> and your team work at.
>> >>>>
>> >>>> I'm trying to have you make small, quantified baby-steps, to verify
>> the accuracy of your methods and data.  What I'm seeing is that you are
>> attempting to make giant-steps, without verification, and then getting
>> low-quality results, without understanding the root causes for them.  You
>> can't dig yourself out of a ditch, and digging harder and more furiously
>> won't raise the accuracy of the parse results.
>> >>>>
>> >>>> --linas
>> >>>>
>> >>>>> We have your MST-Parser-less idea on the map but we are NOT trying
>> it now:
>> >>>>>
>> >>>>> https://github.com/singnet/language-learning/issues/170
>> >>>>>
>> >>>>> We may try it after we explore the account for costs
>> >>>>>
>> >>>>> https://github.com/singnet/language-learning/issues/183
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> -Anton
>> >>>>>
>> >>>>> 24.03.2019 9:24, Linas Vepstas пишет:
>> >>>>>
>> >>>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand
>> on the knob, trembling like a leaf." correctly. It is one of a class of
>> sentences it does not know about.  Which is maybe OK, because ideally, the
>> learned grammar will be able to do this. But today, LG cannot.
>> >>>>>
>> >>>>> --linas
>> >>>>>
>> >>>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <
>> [email protected]> wrote:
>> >>>>>>
>> >>>>>> Anton,
>> >>>>>>
>> >>>>>> It's certainly an unusual corpus, and it might give you rather low
>> scores. I'd call it "interesting", but maybe not "golden". Although I
>> suppose it depends on your training corpus.  Here are some problems that
>> pop out:
>> >>>>>>
>> >>>>>> First sentence --
>> >>>>>> "the old beast was whinnying on his shoulder" -- the word
>> "whinnying" is a fairly rare English verb -- you could read half-a-million
>> wikipedia articles, and not see it once. You could read lots of
>> 19th-century or early-20th century cowboy/adventure novels, (like what
>> you'd find on Project Gutenberg) and maybe see it some fair amount. Even
>> then -- to "whinny on a shoulder" seems bizarre.. I guess he's hugging the
>> horse? How often does that happen, in any cowboy novel? "to whinny on
>> something" is an extremely rare construction.  It will work only if you've
>> correctly categorized "whinny" as a verb that can take a preposition.  Are
>> your clustering algos that good, yet, to correctly cluster rare words into
>> appropriate verb categories?
>> >>>>>>
>> >>>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've
>> never heard of it as a name before.  Your training data is going to be
>> extremely slim on this. And lack of training data means poor statistics,
>> which means low scores.  Unless -- again, your clustering code is good
>> enough to place "Jims" in a "proper name" cluster...
>> >>>>>>
>> >>>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost
>> archaic verb. These days, everyone spells llama with two ll's not one.
>> Unless your talking about Buddhist monks, its a typo.
>> >>>>>>
>> >>>>>> "you understand?"  is .. awkward. Common in speech, uncommon in
>> writing. Unlikely that you'll have enough training data for this.
>> >>>>>>
>> >>>>>> "Willard" is an uncommon name. Does your training corp[us have a
>> sufficient number of mentions of Willard? Do you have clustering working
>> well enough to stick "Willard" into a cluster with other names?
>> >>>>>>
>> >>>>>> "it is so with Sammy Jay" is clearly archaic English.
>> >>>>>>
>> >>>>>> "he hasn't any relations here" is clearly archaic, an
>> olde-fashioned construction.
>> >>>>>>
>> >>>>>> "Pivi said not one word" - again, a clearly old-fashioned
>> construction. Does the training set contain enough examples of "Pivi" to
>> recognize it as a name? Are names clustering correctly?
>> >>>>>>
>> >>>>>> Any sentence with an inversion is going to sound old-fashioned.
>> All of the sentences in that corpus sound old-fashioned. Which maybe is OK
>> if you are training on 19th century Gutenberg texts .. but its certainly
>> not modern English.  Even when I was a child, and I read those old
>> crumbly-yellow paper adventure books, part of the fun was that no one
>> actually talked that way -- not at school, not at home, not on TV. It was
>> clearly from a different time and place -- an adventure.
>> >>>>>>
>> >>>>>> Anyway -- you only indicate pair-wise word-links. Is the omission
>> of disjuncts intentional? Also -- no hint of any word-classes or
>> part-of-speech tagging? This is surely important to evaluate as well, or is
>> this to be done in some other way?  i.e. to evaluate if "Pivi" was
>> correctly clustered with other given names?  Or that lama/llama was
>> clustered with other four-legged animals?
>> >>>>>>
>> >>>>>> Also -- I can't tell -- is it free of loops, or are loops
>> allowed?  Allowing loops tends to provide stronger, more accurate parses.
>> Loops act as constraints.
>> >>>>>>
>> >>>>>> -- Linas
>> >>>>>>
>> >>>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <
>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>> Hi Linas, Andes and whoever understands LG and English well
>> enough both.
>> >>>>>>>
>> >>>>>>> Attached are first 100 sentences for GC "gold standard" -
>> manually checked based on LG parses.
>> >>>>>>>
>> >>>>>>> We are expecting more to come in the next two weeks.
>> >>>>>>>
>> >>>>>>> To enable that, please have cursory review of the corpus and let
>> us know if there are corrections still needed so your corrections will be
>> used as a reference to fix the rest and keep going further.
>> >>>>>>>
>> >>>>>>> Thank you,
>> >>>>>>>
>> >>>>>>> -Anton
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> You received this message because you are subscribed to the
>> Google Groups "lang-learn" group.
>> >>>>>>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to [email protected].
>> >>>>>>> To post to this group, send email to [email protected].
>> >>>>>>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
>> .
>> >>>>>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> cassette tapes - analog TV - film cameras - you
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> cassette tapes - analog TV - film cameras - you
>> >>>>>
>> >>>>> --
>> >>>>> -Anton Kolonin
>> >>>>> skype: akolonin
>> >>>>> cell: +79139250058
>> >>>>> [email protected]
>> >>>>> https://aigents.com
>> >>>>> https://www.youtube.com/aigents
>> >>>>> https://www.facebook.com/aigents
>> >>>>> https://medium.com/@aigents
>> >>>>> https://steemit.com/@aigents
>> >>>>> https://golos.blog/@aigents
>> >>>>> https://vk.com/aigents
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> cassette tapes - analog TV - film cameras - you
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> Groups "lang-learn" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to [email protected].
>> >>>> To post to this group, send email to [email protected].
>> >>>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com
>> .
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>
>> >>>> --
>> >>>> -Anton Kolonin
>> >>>> skype: akolonin
>> >>>> cell: +79139250058
>> >>>> [email protected]
>> >>>> https://aigents.com
>> >>>> https://www.youtube.com/aigents
>> >>>> https://www.facebook.com/aigents
>> >>>> https://medium.com/@aigents
>> >>>> https://steemit.com/@aigents
>> >>>> https://golos.blog/@aigents
>> >>>> https://vk.com/aigents
>> >>
>> >>
>> >>
>> >> --
>> >> cassette tapes - analog TV - film cameras - you
>> >>
>> >> --
>> >> -Anton Kolonin
>> >> skype: akolonin
>> >> cell: +79139250058
>> >> [email protected]
>> >> https://aigents.com
>> >> https://www.youtube.com/aigents
>> >> https://www.facebook.com/aigents
>> >> https://medium.com/@aigents
>> >> https://steemit.com/@aigents
>> >> https://golos.blog/@aigents
>> >> https://vk.com/aigents
>> >
>> >
>> >
>> > --
>> > cassette tapes - analog TV - film cameras - you
>>
>>
>>
>> --
>> Ben Goertzel, PhD
>> http://goertzel.org
>>
>> "Listen: This world is the lunatic's sphere,  /  Don't always agree
>> it's real.  /  Even with my feet upon it / And the postman knowing my
>> door / My address is somewhere else." -- Hafiz
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
> --
> You received this message because you are subscribed to the Google Groups
> "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to