[opencog-dev] Re: 100 sentences for GC

Linas Vepstas Tue, 02 Apr 2019 10:35:18 -0700

Anton, that paper does not address, answer, talk about or mention any of
the questions I have posed to you. I do not understand why we are feuding
about this all the time. --linas


On Tue, Apr 2, 2019 at 12:32 PM Anton Kolonin @ Gmail <[email protected]>
wrote:

> >I don't understand what that "something"
>
> Hi Linas, last year paper is here
>
> http://langlearn.singularitynet.io/data/docs/
>
> this year paper draft is attached.
>
> Cheers,
>
> -Anton
> 03.04.2019 0:09, Linas Vepstas пишет:
>
>
>
> On Tue, Apr 2, 2019 at 12:57 AM Anton Kolonin @ Gmail <[email protected]>
> wrote:
>
>> Hi Linas,
>>
>> Are you saying that "while ULL team has found strong linear correlation
>> between A) quality (F1) on input parses and B) quality (F1) of the output
>> parses based on the grammar learned from the input parses, this phenomenon
>> is due to the fact that they test on the entire input corpus so this
>> phenomena should go away once they test on gold standard corpus consisting
>> only of sentences with high-frequency words"?
>>
>
> I am saying that I have not seen any evidence at all that you actually
> constructed or counted disjuncts, or that you clustered disjuncts, or that
> you controlled or managed counting in any way.
>
> So -- you did something ... but I don't understand what that "something"
> is, and, based on these conversations, that "something" does not match up
> with what I had hoped that you would be doing.
>
> It's not just high-frequency words.  Its also how you perform clustering.
> Are you using MI for that? or cosines for that?  Are you handling
> word-sense disambiguation, or not? How did you handle WSD? Through
> orthogonalization of cosines? Through maximizatino of MI? By computing a
> markov vector? Some other way?  Did you perform any data cuts during
> orhtogonalization/maximization? What kind of cuts were they? How do the
> cuts affect the F1-score?
>
> All of these things deserve "instrumental verification". Without them, I
> don't know how to assign any meaning to F1-scores (or ROC curves, which you
> haven't shown - and even if you did show them, I would not know what they
> mean, until the above questions are resolved.)
>
> So I've got this ball of questions, and I'm getting unclear, confusing
> answers to them.
>
> -- Linas
>
> Best regards,
>>
>> -Anton
>>
>>
>> 02.04.2019 5:38, Linas Vepstas пишет:
>>
>> OK, There's clearly a lot ow work happening in linguistics these days,
>> that I have fallen behind on reading.
>>
>> The nature of the conversations here has been frustrating, because so
>> far, it sounds like an attempt to evade the  "central limit theorem" --
>> https://en.wikipedia.org/wiki/Central_limit_theorem
>>
>> There are two related ideas I'm trying to get across: one is that if you
>> make enough observations of a phenomenon, eventually, the central-limit
>> theorem kicks in, and smooths over random variations.  Specifically, I
>> claim that, despite MST being imperfect, a large number of observations
>> should smooth over the imperfections. I believe this to be true, (but I
>> could be wrong).
>>
>> The other idea is that the golden test corpus must avoid accidentally
>> testing disjuncts far away from the central limit -- to avoid, as it were,
>> making statements analogous to "Well, I flipped the coin three times, and I
>> did not get 50-50 odds, therefore the theory doesn't work". You have to
>> flip the coin at least N times, for some large N.  Here, for MST, we don't
>> know  how big N has to be, we don't have a good plan for determining N.
>> It's worse, cause everything is Zipfian aka 1/f noise. It is possible that
>> BERT or other approaches allow smaller values of N to work, but this is
>> also not clear.
>>
>> Its also not clear that BERT would converge to a different limit than MST
>> - the central-limit theorem says there is only one limit -- not two. But
>> perhaps I'm misapplying it, perhaps I'm neglecting some important effect.
>> Without measurements, its hard to guess what that effect is (if it even
>> exists).
>>
>> Anyway, I have a backlog of half-a-dozen important unread papers, so I'll
>> try to get around to that "real soon now".
>>
>> --linas
>>
>>
>>
>> On Mon, Apr 1, 2019 at 12:15 AM Ben Goertzel <[email protected]> wrote:
>>
>>> "Replacing MST by DNN/BERT" is a strange way to put it...
>>>
>>> DNN/BERT builds a pretty complex and comprehensive language model,
>>> much beyond what is done by calculation of MI values and similar
>>>
>>> The extraction of a parse dag satisfying syntactic constraints (no
>>> links cross, covering all words in the sentence, connected graph) is a
>>> conceptually simple step, and nobody is spending much time on this
>>> step indeed...
>>>
>>> The question of how to assign a quantitative weight to the relation
>>> btw two word-instances in a sentence, taking into account the specific
>>> context in that sentence, but also the history of co-utilization of
>>> those words (or other similar words), is less conceptually simple and
>>> this is one place I think DNN language models can help
>>>
>>> Using MST or similar parsing based on numbers exported from DNN
>>> language models is one way of extracting symbolic-ish structured
>>> knowledge from these big messy subsymbolic probabilistic language
>>> models...
>>>
>>> The DNNs in use now like BERT do not really satisfy me on a
>>> theoretical or conceptual level, but they have been tuned to work
>>> pretty nicely and they have been implemented pretty efficiently on
>>> multi-GPU hardware -- so, given this and given the quality of the
>>> recent practical results obtained with them -- I consider it well
>>> worth exploring how to use them as tools in our pursuits for grammar
>>> and semantics learning
>>>
>>> -- Ben
>>>
>>> On Mon, Apr 1, 2019 at 2:07 PM Linas Vepstas <[email protected]>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Sun, Mar 31, 2019 at 10:51 PM Anton Kolonin @ Gmail <
>>> [email protected]> wrote:
>>> >>
>>> >> Hi Linas, I like this thread more and more :-)
>>> >
>>> > I don't. I use a lot of CAPITALIZED WORDS below.  There is a deep and
>>> dark fundamental misunderstanding, and I am sometimes at wits end trying to
>>> figure out why, and how to explain things in an understandable fashion.
>>> >>
>>> >> >But somehow, I suspect... Isn't this why OpenCog has "unified rule
>>> engine" (URE) instead of link grammar at its core,
>>> >>
>>> >> Linas, the "extraction of phrasemes" goal approaching has been
>>> discussed exactly in terms of MST->GL->URL on the last fall in Hong Kong
>>> discussion:
>>> https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit
>>> >>
>>> >> That is:
>>> >>
>>> >> 1) Do MST-parsing to get word links proto-disjuncts
>>> >>
>>> >> 2) Do Grammar Learning to cluster and conclude word categories and
>>> rules with disjuncts
>>> >>
>>> >> 3) Do URE-kind-of-thing to build the rules into "phrasemes" or
>>> "sections" or "patterns".
>>> >
>>> > Yes.
>>> >>
>>> >> However, your current discourse and our current results just show
>>> that "no one is be able to do reasonable MST-parsing" so the above is just
>>> waste of time, correct?
>>> >
>>> > No. Very much no.  I'm saying the opposite of that. You can replace
>>> MST by almost *ANYTHING* else, and the quality of your results WILL NOT
>>> CHANGE!
>>> >
>>> > If the quality of your results depends on the quality of MST, you are
>>> DOING SOMETHING WRONG!
>>> >
>>> > I'm utterly flabbergasted. I don't know how many more times I can say
>>> this: stop wasting time on this unimportant step!
>>> >>
>>> >> At the time we speak, Ben, Alexely, Sergey and Asuares are trying to
>>> use DNN/BERT magic to do the trick 1.
>>> >
>>> > I want to call this "a complete waste of time". It will almost surely
>>> not improve the quality of the results!  I don't understand why four smart
>>> people think that replacing MST by BERT will make any difference at all!
>>> It should not matter!  Nothing depends on this step! Anything at all,
>>> anything with a probability better than random chance, is sufficient!  Why
>>> isn't this obvious?
>>> >
>>> > If Ben is reading this: I recall talking to Ben about this in an
>>> ice-cream shop in Berlin, for an AGI conference, and he seemed to
>>> understand back then.  I have no idea why he changed his mind.  I really do
>>> not understand why everyone spends so much time obsessing about MST. Is
>>> this a "color of the bike shed" problem?
>>> https://en.wikipedia.org/wiki/Law_of_triviality
>>> >
>>> > MST-vs.-BERT==color-of-bike-shed
>>> >
>>> > Just use MST. It's simple. It works. It gives good results.  Stop
>>> trying to improve it.  The interesting problems are elsewhere!  Just use
>>> MST, and move on to the good stuff!
>>> >>
>>> >> To my mind, that may get possible only if the DNN/BERT magic do the
>>> trick having the steps 2 and 3 done under the hood. If this is done, in
>>> such case, we don't need to do 2 and 3 after we have the DNN/BERT-based
>>> model, because we can simply "milk-out" the grammar rules out of DNN/BERT
>>> micelium for that. And we don't need the ULL as well by the way, because we
>>> just need DNN/BERT and rows of different sorts of milk machines around it.
>>> >
>>> > So why are you bothering to work on ULL?
>>> >>
>>> >> So, instead of solving the problem of constructing the pipeline for
>>> learning grammar from raw text we need to solve the problem of milking the
>>> grammar out of DNN/BERT model trained on these texts, right?
>>> >
>>> > Because I don't think that you know how to milk lexical functions out
>>> of DNN/BERT -- We've wasted more than a year talking about MST.  Instead of
>>> endlessly talking about MST, you could have  JUST USED IT, WITHOUT ANY
>>> MODIFICATIONS, gotten good results, and spent the year working on something
>>> interesting!
>>> >
>>> > Again: replacing MST by DNN/BERT with something else will NOT IMPROVE
>>> the accuracy!  You'll have exactly the same accuracy as before, and if your
>>> accuracy improves, it is because you are doing something wrong!
>>> >
>>> >> However, either way, we need to understand algorithmic machinery of
>>> how the links assemble in disjuncts and disjuncts assemble into sections,
>>> through the universe-scale combinatorial explosion.
>>> >
>>> > No. That is the OPPOSITE of what ACTUALLY HAPPENS!!!!
>>> >>
>>> >> And I agree that clustering and categorizing word and links (and then
>>> disjuncts and sections, right) is part of the process - explicitly in ULL
>>> pipeline or implicitly deep in DNN/BERT darkness.
>>> >
>>> > It is NOT DEEP AND DARK.  I wrote not one but TWO PAPERS on this,
>>> CASTING LIGHT ON THAT DARKNESS
>>> >
>>> > I'm frustrated to the 43rd degree on why I cannot seem to have a
>>> reasonable conversation with any other human being about any of this.
>>> >
>>> > -- Linas
>>> >
>>> >> Cheers,
>>> >>
>>> >> -Anton
>>> >>
>>> >>
>>> >> 01.04.2019 9:17, Linas Vepstas:
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]>
>>> wrote:
>>> >>>
>>> >>> Linas Vepstas wrote:
>>> >>>
>>> >>> >... knowledge extraction can be done generically, and not just on
>>> language.
>>> >>>
>>> >>> If link grammar would be Turing complete, this might be possible
>>> right away.
>>> >>
>>> >>
>>> >> In my experience, thinking about Turing completeness is unproductive
>>> and a distraction.
>>> >>
>>> >>> But somehow, I suspect... Isn't this why OpenCog has "unified rule
>>> engine" (URE) instead of link grammar at its core,
>>> >>
>>> >>
>>> >> No. It has the rule-engine because back then, I did not understand
>>> sheaves.  I'm starting to think that the rule engine is a strategic
>>> mistake. The original idea is that rule-application is the main conceptual
>>> abstraction of term-rewriting.  One rewrites, or proves theorems by
>>> applying sequences of rules.  It turns out that discovering the right
>>> sequence is hard. Finding correct long sequences is hard - a combinatorial
>>> explosion.
>>> >>
>>> >> The openpsi system addresses some of these issues. Unfortunately,
>>> it's current implementation is a tangle of rule-selection mechanisms, and
>>> theories of human psychology. It's probably better than the URE, but is
>>> currently not as powerful.
>>> >>
>>> >> I'm trying to place a theory of sheaves as a replacement for URE, and
>>> as the natural generalization of openpsi, but I've successfully
>>> self-sabotaged myself in these efforts.
>>> >>
>>> >>>
>>> >>> and with URE things get much more complicated. I'm sorry, but that
>>> is still a Gordian knot to me, considering all of my modest knowledge.
>>> >>
>>> >>
>>> >> We all have modest knowledge. That is the nature of the human
>>> condition.
>>> >>
>>> >>>
>>> >>> On the other hand, if someone really smart would provide automatic
>>> grammar extraction by means of unrestricted grammar, I believe that would
>>> be it.
>>> >>
>>> >>
>>> >> Yes, that is the goal of the language-learning project.  However, as
>>> noted in my last email (on the link-grammar list) it is not enough to just
>>> learn a semi-Thue system, declare victory, and go home.  The example I gave
>>> there:
>>> >>
>>> >>   "I think that you should give that car a second look"
>>> >>   "you should really give that song a second listen"
>>> >>   "maybe you should give Sue a second chance".
>>> >>
>>> >> Learning to parse these "set phrases" or phrasemes is equivalent to
>>> learning a semi-Thue system; however, its not enough to realize that all
>>> three are forms of advice-giving, having "conserved" or "fixed" regions "x
>>> YOU SHOULD y GIVE z SECOND w" where z is very highly variable having
>>> millions of variations, and w only has a few dozen allowed variations.
>>> Note that the words "fixed", "conserved", "variable" are words used in
>>> genetics and proteomics and antibody structure. Its the same idea.
>>> >>
>>> >> The goal of learning lexical functions (LF's) is to learn that all
>>> three are advice-giving forms, and also to learn what is, and what can be
>>> plugged in for x,y,z,w.   So, although a super-whiz-bang grammar learner
>>> capable of learning context-sensitive languages should be able to learn "x
>>> YOU SHOULD y GIVE z SECOND w", it still will not know the *meaning* of this
>>> phrase.  To know the *meaning*, you have to know the acceptable ranges (as
>>> fuzzy-sets) of x,y,z,w.
>>> >>
>>> >> To conclude, thinking about Turing-completeness is a waste of time,
>>> because Turing completeness only tells you that "x YOU SHOULD y GIVE z
>>> SECOND w" is recursively enumerable; it does not tell you what it actually
>>> means.
>>> >>
>>> >> Put another way:  having a universal Turing machine is not the same
>>> as knowing how some particular program works. Automagically learning a
>>> context-sensitive grammar is not enough to know what that grammar is
>>> "saying/doing".
>>> >>
>>> >> -- Linas
>>> >>
>>> >>>
>>> >>>
>>> >>> Thank you,
>>> >>> Ivan V.
>>> >>>
>>> >>>
>>> >>> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]>
>>> napisao je:
>>> >>>>
>>> >>>> Ben, Linas,
>>> >>>>
>>> >>>> >But we know that MST parsing is shit.  Stop wasting time on MST or
>>> trying to "improve" it.
>>> >>>>
>>> >>>> I think that sounds like kind of support for the concept of "dumb
>>> explosive parsing" being advocated for 1+ year ago:
>>> >>>>
>>> >>>>
>>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy
>>> >>>>
>>> >>>> I also agree we other Linas'es reasoning in this thread. I would
>>> consider giving it a try starting next month if we don't have a
>>> breakthrough with DNN-MI-milking-based-MST-Parsing by that time.
>>> >>>>
>>> >>>> > can be done generically, and not just on language
>>> >>>>
>>> >>>> I think everyone in bio-informatics dreams of extracting secrets of
>>> "dark side of the genome" with something like that ;-)
>>> >>>>
>>> >>>> Cheers,
>>> >>>>
>>> >>>> -Anton
>>> >>>>
>>> >>>>
>>> >>>> 28.03.2019 1:24, Linas Vepstas пишет:
>>> >>>>
>>> >>>> Hi Anton,
>>> >>>>
>>> >>>> I've cc'ed the link-grammar mailing list, because I describe below
>>> some concepts for word-sense disambiguation. I'm also cc'ing the opencog
>>> mailing list and ivan vodisek, because after studying hilbert systems, I
>>> think he's ready to think about how knowledge extraction can be done
>>> generically, and not just on language.
>>> >>>>
>>> >>>> -- Linas
>>> >>>>
>>> >>>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail <
>>> [email protected]> wrote:
>>> >>>>>
>>> >>>>> Hi Linas,
>>> >>>>>
>>> >>>>> >I'd call it "interesting", but maybe not "golden"
>>> >>>>>
>>> >>>>> These are randomly selected sentences from "Gutenberg Children"
>>> corpus:
>>> >>>>>
>>> >>>>>
>>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/
>>> >>>>>
>>> >>>>> "Gutenberg Children silver standard" is LG-English parses:
>>> >>>>>
>>> >>>>>
>>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull
>>> >>>>>
>>> >>>>> "Gutenberg Children gold standard" is subset of "silver standard"
>>> with semi-random selection of sentences skipping direct speech and doing
>>> manual verification of the links.
>>> >>>>>
>>> >>>>> So as long as we are training on "Gutenberg Children" corpus,
>>> having the test on the same "Gutenberg Children" seems reasonable, right?
>>> >>>>
>>> >>>>
>>> >>>> Yes. You still need to verify that each word in the "golden" corpus
>>> occurs at least N=10 or 20 times in the training corpus. The dependency of
>>> accuracy on N is not generally known, but it is very clear that if a word
>>> occurs only N=3 times in the training corpus, then whatever is learned
>>> about it will be very low quality.
>>> >>>>
>>> >>>>>
>>> >>>>> But thanks, we may have put mire effort in removal of ancient
>>> constructions and words even if these are present in the corpus.
>>> >>>>
>>> >>>> If you consistently train on 19th century literature, and then
>>> evaluate 19th-century literature comprehension, that's fine.  Just don't
>>> expect it to work for 21st century blog posts.
>>> >>>>
>>> >>>> The strongest effect will be the N=number of observations effect.
>>> >>>>
>>> >>>>>
>>> >>>>> >Anyway -- you only indicate pair-wise word-links. Is the omission
>>> of disjuncts intentional?
>>> >>>>>
>>> >>>>> If you have all links in the sentence, you can construct all of
>>> the disjuncts with o ambiguity, correct?
>>> >>>>
>>> >>>> No, but only because you did not indicate the link-type.  The whole
>>> point of a clustering step is to obtain a link-type; if you discard it, you
>>> will never get  better-than-MST results. The link-type is critical for
>>> obtaining the word-classes.  The whole point of learning is to learn the
>>> word-classes; you've learned very little, if you know only word-pairs.
>>> >>>>
>>> >>>> Consider this example:
>>> >>>>
>>> >>>> I saw wood
>>> >>>> I saw some wood
>>> >>>>
>>> >>>> A solution that would be "almost perfect" (or "golden") would be
>>> this:
>>> >>>>
>>> >>>> saw: {performer-of-actions}- & {sculptable-mass}+;
>>> >>>> saw: {observer}-  & {viewable-thing}+;
>>> >>>>
>>> >>>> These disambiguate the two different senses of the word "saw".
>>> It's impossible to have word-sense disambiguation without actually having
>>> these disjuncts.  The word-pairs alone are not sufficient to report the
>>> link-type connecting the words.  Clustering gives the other dictionary
>>> entries:
>>> >>>>
>>> >>>> I: {performer-of-actions}+ or {observer}+;
>>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- &
>>> {viewable-thing}-);
>>> >>>> some: {quantity-determiner}+;
>>> >>>>
>>> >>>> Thus, the pronoun "I" also belong to two different word-sense
>>> categories: performers and observers.  Compare to:
>>> >>>>
>>> >>>> "The chainsaw saws wood"  -- a "chainsaw" can be  a "performer of
>>> actions" but cannot be an "observer".
>>> >>>> "The dog saw some wood" -- dogs can be observers. They can perform
>>> some actions; like run, jump, but they cannot saw, hammer, cut, stab.
>>> >>>>
>>> >>>> The link-type is absolutely crucial to understanding a word.  The
>>> language-learning project is all about learning the link-types. Without
>>> correct link-type assignments, you cannot have correct parses.
>>> >>>>
>>> >>>> ... which is 100% of the problem with MST.  The problem with MST is
>>> not so much that "its not accurate" -sure, it is not terribly accurate. But
>>> even if MST or some MST-replacement was 100% accurate, it would still be
>>> "wrong" because it fails to indicate the link-type.  If you want to
>>> understand a sentence, you MUST know the link-types!
>>> >>>>
>>> >>>> Otherwise, you just have "green ideas sleep furiously", which
>>> parses, but only because the link types have been erased, or made stupid.
>>> Here's a stupid grammar:
>>> >>>>
>>> >>>> ideas:  {adjective}- & {verb}+;
>>> >>>> green: {adjective}+;
>>> >>>>
>>> >>>> which allows "green ideas" to parse.  But of course, this is wrong;
>>> it should have been:
>>> >>>>
>>> >>>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+;
>>> >>>> green: {physical-object-modifier}+;
>>> >>>>
>>> >>>> and now it is clear that "green ideas" cannot parse, because the
>>> link-types clash.
>>> >>>>
>>> >>>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun
>>> ...) you will get very low quality grammars.
>>> >>>>
>>> >>>> * If you cluster to 200 or 300 clusters, you get sort-of-OK
>>> grammars. This is what deep-learning/neural-nets do: this is why the
>>> deep-learning systems seem to give nice results: 200 or 300 features is
>>> enough to start having adequate functional distinctions (e.g. the famous
>>> "king - male+female=queen" example, or "paris-france+germany=berlin"
>>> example)
>>> >>>>
>>> >>>> * If you cluster to 3K to 8K clusters, you start having a quite
>>> decent model of language
>>> >>>>
>>> >>>> * Note that wordnet has 117K "synsets".
>>> >>>>
>>> >>>> Note that in the above example:
>>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- &
>>> {viewable-thing}-);
>>> >>>>
>>> >>>> the things in the curly-braces are effectively "synsets".
>>> >>>>
>>> >>>> The next set of goal-posts is to have disjuncts, of maybe
>>> low-medium quality, and use these to extract ontologies.  e.g.
>>> >>>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing}
>>> >>>>
>>> >>>> You can try to do this by clustering but there are probably better
>>> ways of discovering ontology.
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> >Also -- no hint of any word-classes or part-of-speech tagging?
>>> This is surely important to evaluate as well, or is this to be done in some
>>> other way?  i.e. to evaluate if "Pivi" was correctly clustered with other
>>> given names?  Or that lama/llama was clustered with other four-legged
>>> animals?
>>> >>>>>
>>> >>>>> We don't have that in MST-Parsing, right? We need this corpus to
>>> assess the quality of the MST-Parsing so we don't need part-of-speech
>>> information for that.
>>> >>>>
>>> >>>> But we know that MST parsing is shit.  Stop wasting time on MST or
>>> trying to "improve" it. We already know that it is close to a high-entropy
>>> path to structure; trying to squeeze a few more percent of entropy is not
>>> worth the effort, not at this time.  Focus on finding a high-entropy
>>> structure extraction algorithm, don't waste time on MST.
>>> >>>>
>>> >>>> You should be focusing on extracting disjuncts, word-classes,
>>> word-senses, and trying to improve the quality of those.  If you obtain a
>>> high-entropy path to these structures, the quality of your parses will
>>> automatically improve.  Focus on the entropy numbers. Try to maximize that.
>>> >>>>
>>> >>>>> The clustering is able to do that anyway - see the graphs in the
>>> end of the last year report:
>>> >>>>>
>>> >>>>>
>>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou
>>> >>>>>
>>> >>>>> >Also -- I can't tell -- is it free of loops, or are loops
>>> allowed?  Allowing loops tends to provide stronger, more accurate parses.
>>> Loops act as constraints.
>>> >>>>>
>>> >>>>> The loops and crossing links are not allowed in the MST-Parser
>>> now. If we allow them in the test corpus, how could it make assessment of
>>> MST-Parses better?
>>> >>>>>
>>> >>>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's
>>> directions.
>>> >>>>
>>> >>>>
>>> >>>> Not to say bad things about Ben, but I'm certain he has not
>>> actually thought about this problem very much. He is very very busy doing
>>> other things; he is not thinking about this stuff.  I have repeatedly tried
>>> to explain the issues to him, and its quite clear that he is far away from
>>> understanding them, from working at the level that I would like to have you
>>> and your team work at.
>>> >>>>
>>> >>>> I'm trying to have you make small, quantified baby-steps, to verify
>>> the accuracy of your methods and data.  What I'm seeing is that you are
>>> attempting to make giant-steps, without verification, and then getting
>>> low-quality results, without understanding the root causes for them.  You
>>> can't dig yourself out of a ditch, and digging harder and more furiously
>>> won't raise the accuracy of the parse results.
>>> >>>>
>>> >>>> --linas
>>> >>>>
>>> >>>>> We have your MST-Parser-less idea on the map but we are NOT trying
>>> it now:
>>> >>>>>
>>> >>>>> https://github.com/singnet/language-learning/issues/170
>>> >>>>>
>>> >>>>> We may try it after we explore the account for costs
>>> >>>>>
>>> >>>>> https://github.com/singnet/language-learning/issues/183
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>>
>>> >>>>> -Anton
>>> >>>>>
>>> >>>>> 24.03.2019 9:24, Linas Vepstas пишет:
>>> >>>>>
>>> >>>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand
>>> on the knob, trembling like a leaf." correctly. It is one of a class of
>>> sentences it does not know about.  Which is maybe OK, because ideally, the
>>> learned grammar will be able to do this. But today, LG cannot.
>>> >>>>>
>>> >>>>> --linas
>>> >>>>>
>>> >>>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas <
>>> [email protected]> wrote:
>>> >>>>>>
>>> >>>>>> Anton,
>>> >>>>>>
>>> >>>>>> It's certainly an unusual corpus, and it might give you rather
>>> low scores. I'd call it "interesting", but maybe not "golden". Although I
>>> suppose it depends on your training corpus.  Here are some problems that
>>> pop out:
>>> >>>>>>
>>> >>>>>> First sentence --
>>> >>>>>> "the old beast was whinnying on his shoulder" -- the word
>>> "whinnying" is a fairly rare English verb -- you could read half-a-million
>>> wikipedia articles, and not see it once. You could read lots of
>>> 19th-century or early-20th century cowboy/adventure novels, (like what
>>> you'd find on Project Gutenberg) and maybe see it some fair amount. Even
>>> then -- to "whinny on a shoulder" seems bizarre.. I guess he's hugging the
>>> horse? How often does that happen, in any cowboy novel? "to whinny on
>>> something" is an extremely rare construction.  It will work only if you've
>>> correctly categorized "whinny" as a verb that can take a preposition.  Are
>>> your clustering algos that good, yet, to correctly cluster rare words into
>>> appropriate verb categories?
>>> >>>>>>
>>> >>>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've
>>> never heard of it as a name before.  Your training data is going to be
>>> extremely slim on this. And lack of training data means poor statistics,
>>> which means low scores.  Unless -- again, your clustering code is good
>>> enough to place "Jims" in a "proper name" cluster...
>>> >>>>>>
>>> >>>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon,
>>> almost archaic verb. These days, everyone spells llama with two ll's not
>>> one. Unless your talking about Buddhist monks, its a typo.
>>> >>>>>>
>>> >>>>>> "you understand?"  is .. awkward. Common in speech, uncommon in
>>> writing. Unlikely that you'll have enough training data for this.
>>> >>>>>>
>>> >>>>>> "Willard" is an uncommon name. Does your training corp[us have a
>>> sufficient number of mentions of Willard? Do you have clustering working
>>> well enough to stick "Willard" into a cluster with other names?
>>> >>>>>>
>>> >>>>>> "it is so with Sammy Jay" is clearly archaic English.
>>> >>>>>>
>>> >>>>>> "he hasn't any relations here" is clearly archaic, an
>>> olde-fashioned construction.
>>> >>>>>>
>>> >>>>>> "Pivi said not one word" - again, a clearly old-fashioned
>>> construction. Does the training set contain enough examples of "Pivi" to
>>> recognize it as a name? Are names clustering correctly?
>>> >>>>>>
>>> >>>>>> Any sentence with an inversion is going to sound old-fashioned.
>>> All of the sentences in that corpus sound old-fashioned. Which maybe is OK
>>> if you are training on 19th century Gutenberg texts .. but its certainly
>>> not modern English.  Even when I was a child, and I read those old
>>> crumbly-yellow paper adventure books, part of the fun was that no one
>>> actually talked that way -- not at school, not at home, not on TV. It was
>>> clearly from a different time and place -- an adventure.
>>> >>>>>>
>>> >>>>>> Anyway -- you only indicate pair-wise word-links. Is the omission
>>> of disjuncts intentional? Also -- no hint of any word-classes or
>>> part-of-speech tagging? This is surely important to evaluate as well, or is
>>> this to be done in some other way?  i.e. to evaluate if "Pivi" was
>>> correctly clustered with other given names?  Or that lama/llama was
>>> clustered with other four-legged animals?
>>> >>>>>>
>>> >>>>>> Also -- I can't tell -- is it free of loops, or are loops
>>> allowed?  Allowing loops tends to provide stronger, more accurate parses.
>>> Loops act as constraints.
>>> >>>>>>
>>> >>>>>> -- Linas
>>> >>>>>>
>>> >>>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail <
>>> [email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi Linas, Andes and whoever understands LG and English well
>>> enough both.
>>> >>>>>>>
>>> >>>>>>> Attached are first 100 sentences for GC "gold standard" -
>>> manually checked based on LG parses.
>>> >>>>>>>
>>> >>>>>>> We are expecting more to come in the next two weeks.
>>> >>>>>>>
>>> >>>>>>> To enable that, please have cursory review of the corpus and let
>>> us know if there are corrections still needed so your corrections will be
>>> used as a reference to fix the rest and keep going further.
>>> >>>>>>>
>>> >>>>>>> Thank you,
>>> >>>>>>>
>>> >>>>>>> -Anton
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> You received this message because you are subscribed to the
>>> Google Groups "lang-learn" group.
>>> >>>>>>> To unsubscribe from this group and stop receiving emails from
>>> it, send an email to [email protected].
>>> >>>>>>> To post to this group, send email to [email protected]
>>> .
>>> >>>>>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com
>>> .
>>> >>>>>>> For more options, visit https://groups.google.com/d/optout.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> cassette tapes - analog TV - film cameras - you
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> cassette tapes - analog TV - film cameras - you
>>> >>>>>
>>> >>>>> --
>>> >>>>> -Anton Kolonin
>>> >>>>> skype: akolonin
>>> >>>>> cell: +79139250058
>>> >>>>> [email protected]
>>> >>>>> https://aigents.com
>>> >>>>> https://www.youtube.com/aigents
>>> >>>>> https://www.facebook.com/aigents
>>> >>>>> https://medium.com/@aigents
>>> >>>>> https://steemit.com/@aigents
>>> >>>>> https://golos.blog/@aigents
>>> >>>>> https://vk.com/aigents
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> cassette tapes - analog TV - film cameras - you
>>> >>>> --
>>> >>>> You received this message because you are subscribed to the Google
>>> Groups "lang-learn" group.
>>> >>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to [email protected].
>>> >>>> To post to this group, send email to [email protected].
>>> >>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com
>>> .
>>> >>>> For more options, visit https://groups.google.com/d/optout.
>>> >>>>
>>> >>>> --
>>> >>>> -Anton Kolonin
>>> >>>> skype: akolonin
>>> >>>> cell: +79139250058
>>> >>>> [email protected]
>>> >>>> https://aigents.com
>>> >>>> https://www.youtube.com/aigents
>>> >>>> https://www.facebook.com/aigents
>>> >>>> https://medium.com/@aigents
>>> >>>> https://steemit.com/@aigents
>>> >>>> https://golos.blog/@aigents
>>> >>>> https://vk.com/aigents
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> cassette tapes - analog TV - film cameras - you
>>> >>
>>> >> --
>>> >> -Anton Kolonin
>>> >> skype: akolonin
>>> >> cell: +79139250058
>>> >> [email protected]
>>> >> https://aigents.com
>>> >> https://www.youtube.com/aigents
>>> >> https://www.facebook.com/aigents
>>> >> https://medium.com/@aigents
>>> >> https://steemit.com/@aigents
>>> >> https://golos.blog/@aigents
>>> >> https://vk.com/aigents
>>> >
>>> >
>>> >
>>> > --
>>> > cassette tapes - analog TV - film cameras - you
>>>
>>>
>>>
>>> --
>>> Ben Goertzel, PhD
>>> http://goertzel.org
>>>
>>> "Listen: This world is the lunatic's sphere,  /  Don't always agree
>>> it's real.  /  Even with my feet upon it / And the postman knowing my
>>> door / My address is somewhere else." -- Hafiz
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>> --
>> You received this message because you are subscribed to the Google Groups
>> "lang-learn" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> -Anton Kolonin
>> skype: akolonin
>> cell: 
>> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>>
>>
>
> --
> cassette tapes - analog TV - film cameras - you
> --
> You received this message because you are subscribed to the Google Groups
> "lang-learn" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/lang-learn/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com
> <https://groups.google.com/d/msgid/lang-learn/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> -Anton Kolonin
> skype: akolonin
> cell: 
> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents
>
>

-- 
cassette tapes - analog TV - film cameras - you

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36hBoqJ9jrhLnxnVbf-PApzCbvaYtK4JWiqQv96X6UFYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[opencog-dev] Re: 100 sentences for GC

Reply via email to