Anton, that paper does not address, answer, talk about or mention any of the questions I have posed to you. I do not understand why we are feuding about this all the time. --linas
On Tue, Apr 2, 2019 at 12:32 PM Anton Kolonin @ Gmail <[email protected]> wrote: > >I don't understand what that "something" > > Hi Linas, last year paper is here > > http://langlearn.singularitynet.io/data/docs/ > > this year paper draft is attached. > > Cheers, > > -Anton > 03.04.2019 0:09, Linas Vepstas пишет: > > > > On Tue, Apr 2, 2019 at 12:57 AM Anton Kolonin @ Gmail <[email protected]> > wrote: > >> Hi Linas, >> >> Are you saying that "while ULL team has found strong linear correlation >> between A) quality (F1) on input parses and B) quality (F1) of the output >> parses based on the grammar learned from the input parses, this phenomenon >> is due to the fact that they test on the entire input corpus so this >> phenomena should go away once they test on gold standard corpus consisting >> only of sentences with high-frequency words"? >> > > I am saying that I have not seen any evidence at all that you actually > constructed or counted disjuncts, or that you clustered disjuncts, or that > you controlled or managed counting in any way. > > So -- you did something ... but I don't understand what that "something" > is, and, based on these conversations, that "something" does not match up > with what I had hoped that you would be doing. > > It's not just high-frequency words. Its also how you perform clustering. > Are you using MI for that? or cosines for that? Are you handling > word-sense disambiguation, or not? How did you handle WSD? Through > orthogonalization of cosines? Through maximizatino of MI? By computing a > markov vector? Some other way? Did you perform any data cuts during > orhtogonalization/maximization? What kind of cuts were they? How do the > cuts affect the F1-score? > > All of these things deserve "instrumental verification". Without them, I > don't know how to assign any meaning to F1-scores (or ROC curves, which you > haven't shown - and even if you did show them, I would not know what they > mean, until the above questions are resolved.) > > So I've got this ball of questions, and I'm getting unclear, confusing > answers to them. > > -- Linas > > Best regards, >> >> -Anton >> >> >> 02.04.2019 5:38, Linas Vepstas пишет: >> >> OK, There's clearly a lot ow work happening in linguistics these days, >> that I have fallen behind on reading. >> >> The nature of the conversations here has been frustrating, because so >> far, it sounds like an attempt to evade the "central limit theorem" -- >> https://en.wikipedia.org/wiki/Central_limit_theorem >> >> There are two related ideas I'm trying to get across: one is that if you >> make enough observations of a phenomenon, eventually, the central-limit >> theorem kicks in, and smooths over random variations. Specifically, I >> claim that, despite MST being imperfect, a large number of observations >> should smooth over the imperfections. I believe this to be true, (but I >> could be wrong). >> >> The other idea is that the golden test corpus must avoid accidentally >> testing disjuncts far away from the central limit -- to avoid, as it were, >> making statements analogous to "Well, I flipped the coin three times, and I >> did not get 50-50 odds, therefore the theory doesn't work". You have to >> flip the coin at least N times, for some large N. Here, for MST, we don't >> know how big N has to be, we don't have a good plan for determining N. >> It's worse, cause everything is Zipfian aka 1/f noise. It is possible that >> BERT or other approaches allow smaller values of N to work, but this is >> also not clear. >> >> Its also not clear that BERT would converge to a different limit than MST >> - the central-limit theorem says there is only one limit -- not two. But >> perhaps I'm misapplying it, perhaps I'm neglecting some important effect. >> Without measurements, its hard to guess what that effect is (if it even >> exists). >> >> Anyway, I have a backlog of half-a-dozen important unread papers, so I'll >> try to get around to that "real soon now". >> >> --linas >> >> >> >> On Mon, Apr 1, 2019 at 12:15 AM Ben Goertzel <[email protected]> wrote: >> >>> "Replacing MST by DNN/BERT" is a strange way to put it... >>> >>> DNN/BERT builds a pretty complex and comprehensive language model, >>> much beyond what is done by calculation of MI values and similar >>> >>> The extraction of a parse dag satisfying syntactic constraints (no >>> links cross, covering all words in the sentence, connected graph) is a >>> conceptually simple step, and nobody is spending much time on this >>> step indeed... >>> >>> The question of how to assign a quantitative weight to the relation >>> btw two word-instances in a sentence, taking into account the specific >>> context in that sentence, but also the history of co-utilization of >>> those words (or other similar words), is less conceptually simple and >>> this is one place I think DNN language models can help >>> >>> Using MST or similar parsing based on numbers exported from DNN >>> language models is one way of extracting symbolic-ish structured >>> knowledge from these big messy subsymbolic probabilistic language >>> models... >>> >>> The DNNs in use now like BERT do not really satisfy me on a >>> theoretical or conceptual level, but they have been tuned to work >>> pretty nicely and they have been implemented pretty efficiently on >>> multi-GPU hardware -- so, given this and given the quality of the >>> recent practical results obtained with them -- I consider it well >>> worth exploring how to use them as tools in our pursuits for grammar >>> and semantics learning >>> >>> -- Ben >>> >>> On Mon, Apr 1, 2019 at 2:07 PM Linas Vepstas <[email protected]> >>> wrote: >>> > >>> > >>> > >>> > On Sun, Mar 31, 2019 at 10:51 PM Anton Kolonin @ Gmail < >>> [email protected]> wrote: >>> >> >>> >> Hi Linas, I like this thread more and more :-) >>> > >>> > I don't. I use a lot of CAPITALIZED WORDS below. There is a deep and >>> dark fundamental misunderstanding, and I am sometimes at wits end trying to >>> figure out why, and how to explain things in an understandable fashion. >>> >> >>> >> >But somehow, I suspect... Isn't this why OpenCog has "unified rule >>> engine" (URE) instead of link grammar at its core, >>> >> >>> >> Linas, the "extraction of phrasemes" goal approaching has been >>> discussed exactly in terms of MST->GL->URL on the last fall in Hong Kong >>> discussion: >>> https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit >>> >> >>> >> That is: >>> >> >>> >> 1) Do MST-parsing to get word links proto-disjuncts >>> >> >>> >> 2) Do Grammar Learning to cluster and conclude word categories and >>> rules with disjuncts >>> >> >>> >> 3) Do URE-kind-of-thing to build the rules into "phrasemes" or >>> "sections" or "patterns". >>> > >>> > Yes. >>> >> >>> >> However, your current discourse and our current results just show >>> that "no one is be able to do reasonable MST-parsing" so the above is just >>> waste of time, correct? >>> > >>> > No. Very much no. I'm saying the opposite of that. You can replace >>> MST by almost *ANYTHING* else, and the quality of your results WILL NOT >>> CHANGE! >>> > >>> > If the quality of your results depends on the quality of MST, you are >>> DOING SOMETHING WRONG! >>> > >>> > I'm utterly flabbergasted. I don't know how many more times I can say >>> this: stop wasting time on this unimportant step! >>> >> >>> >> At the time we speak, Ben, Alexely, Sergey and Asuares are trying to >>> use DNN/BERT magic to do the trick 1. >>> > >>> > I want to call this "a complete waste of time". It will almost surely >>> not improve the quality of the results! I don't understand why four smart >>> people think that replacing MST by BERT will make any difference at all! >>> It should not matter! Nothing depends on this step! Anything at all, >>> anything with a probability better than random chance, is sufficient! Why >>> isn't this obvious? >>> > >>> > If Ben is reading this: I recall talking to Ben about this in an >>> ice-cream shop in Berlin, for an AGI conference, and he seemed to >>> understand back then. I have no idea why he changed his mind. I really do >>> not understand why everyone spends so much time obsessing about MST. Is >>> this a "color of the bike shed" problem? >>> https://en.wikipedia.org/wiki/Law_of_triviality >>> > >>> > MST-vs.-BERT==color-of-bike-shed >>> > >>> > Just use MST. It's simple. It works. It gives good results. Stop >>> trying to improve it. The interesting problems are elsewhere! Just use >>> MST, and move on to the good stuff! >>> >> >>> >> To my mind, that may get possible only if the DNN/BERT magic do the >>> trick having the steps 2 and 3 done under the hood. If this is done, in >>> such case, we don't need to do 2 and 3 after we have the DNN/BERT-based >>> model, because we can simply "milk-out" the grammar rules out of DNN/BERT >>> micelium for that. And we don't need the ULL as well by the way, because we >>> just need DNN/BERT and rows of different sorts of milk machines around it. >>> > >>> > So why are you bothering to work on ULL? >>> >> >>> >> So, instead of solving the problem of constructing the pipeline for >>> learning grammar from raw text we need to solve the problem of milking the >>> grammar out of DNN/BERT model trained on these texts, right? >>> > >>> > Because I don't think that you know how to milk lexical functions out >>> of DNN/BERT -- We've wasted more than a year talking about MST. Instead of >>> endlessly talking about MST, you could have JUST USED IT, WITHOUT ANY >>> MODIFICATIONS, gotten good results, and spent the year working on something >>> interesting! >>> > >>> > Again: replacing MST by DNN/BERT with something else will NOT IMPROVE >>> the accuracy! You'll have exactly the same accuracy as before, and if your >>> accuracy improves, it is because you are doing something wrong! >>> > >>> >> However, either way, we need to understand algorithmic machinery of >>> how the links assemble in disjuncts and disjuncts assemble into sections, >>> through the universe-scale combinatorial explosion. >>> > >>> > No. That is the OPPOSITE of what ACTUALLY HAPPENS!!!! >>> >> >>> >> And I agree that clustering and categorizing word and links (and then >>> disjuncts and sections, right) is part of the process - explicitly in ULL >>> pipeline or implicitly deep in DNN/BERT darkness. >>> > >>> > It is NOT DEEP AND DARK. I wrote not one but TWO PAPERS on this, >>> CASTING LIGHT ON THAT DARKNESS >>> > >>> > I'm frustrated to the 43rd degree on why I cannot seem to have a >>> reasonable conversation with any other human being about any of this. >>> > >>> > -- Linas >>> > >>> >> Cheers, >>> >> >>> >> -Anton >>> >> >>> >> >>> >> 01.04.2019 9:17, Linas Vepstas: >>> >> >>> >> >>> >> >>> >> On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]> >>> wrote: >>> >>> >>> >>> Linas Vepstas wrote: >>> >>> >>> >>> >... knowledge extraction can be done generically, and not just on >>> language. >>> >>> >>> >>> If link grammar would be Turing complete, this might be possible >>> right away. >>> >> >>> >> >>> >> In my experience, thinking about Turing completeness is unproductive >>> and a distraction. >>> >> >>> >>> But somehow, I suspect... Isn't this why OpenCog has "unified rule >>> engine" (URE) instead of link grammar at its core, >>> >> >>> >> >>> >> No. It has the rule-engine because back then, I did not understand >>> sheaves. I'm starting to think that the rule engine is a strategic >>> mistake. The original idea is that rule-application is the main conceptual >>> abstraction of term-rewriting. One rewrites, or proves theorems by >>> applying sequences of rules. It turns out that discovering the right >>> sequence is hard. Finding correct long sequences is hard - a combinatorial >>> explosion. >>> >> >>> >> The openpsi system addresses some of these issues. Unfortunately, >>> it's current implementation is a tangle of rule-selection mechanisms, and >>> theories of human psychology. It's probably better than the URE, but is >>> currently not as powerful. >>> >> >>> >> I'm trying to place a theory of sheaves as a replacement for URE, and >>> as the natural generalization of openpsi, but I've successfully >>> self-sabotaged myself in these efforts. >>> >> >>> >>> >>> >>> and with URE things get much more complicated. I'm sorry, but that >>> is still a Gordian knot to me, considering all of my modest knowledge. >>> >> >>> >> >>> >> We all have modest knowledge. That is the nature of the human >>> condition. >>> >> >>> >>> >>> >>> On the other hand, if someone really smart would provide automatic >>> grammar extraction by means of unrestricted grammar, I believe that would >>> be it. >>> >> >>> >> >>> >> Yes, that is the goal of the language-learning project. However, as >>> noted in my last email (on the link-grammar list) it is not enough to just >>> learn a semi-Thue system, declare victory, and go home. The example I gave >>> there: >>> >> >>> >> "I think that you should give that car a second look" >>> >> "you should really give that song a second listen" >>> >> "maybe you should give Sue a second chance". >>> >> >>> >> Learning to parse these "set phrases" or phrasemes is equivalent to >>> learning a semi-Thue system; however, its not enough to realize that all >>> three are forms of advice-giving, having "conserved" or "fixed" regions "x >>> YOU SHOULD y GIVE z SECOND w" where z is very highly variable having >>> millions of variations, and w only has a few dozen allowed variations. >>> Note that the words "fixed", "conserved", "variable" are words used in >>> genetics and proteomics and antibody structure. Its the same idea. >>> >> >>> >> The goal of learning lexical functions (LF's) is to learn that all >>> three are advice-giving forms, and also to learn what is, and what can be >>> plugged in for x,y,z,w. So, although a super-whiz-bang grammar learner >>> capable of learning context-sensitive languages should be able to learn "x >>> YOU SHOULD y GIVE z SECOND w", it still will not know the *meaning* of this >>> phrase. To know the *meaning*, you have to know the acceptable ranges (as >>> fuzzy-sets) of x,y,z,w. >>> >> >>> >> To conclude, thinking about Turing-completeness is a waste of time, >>> because Turing completeness only tells you that "x YOU SHOULD y GIVE z >>> SECOND w" is recursively enumerable; it does not tell you what it actually >>> means. >>> >> >>> >> Put another way: having a universal Turing machine is not the same >>> as knowing how some particular program works. Automagically learning a >>> context-sensitive grammar is not enough to know what that grammar is >>> "saying/doing". >>> >> >>> >> -- Linas >>> >> >>> >>> >>> >>> >>> >>> Thank you, >>> >>> Ivan V. >>> >>> >>> >>> >>> >>> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]> >>> napisao je: >>> >>>> >>> >>>> Ben, Linas, >>> >>>> >>> >>>> >But we know that MST parsing is shit. Stop wasting time on MST or >>> trying to "improve" it. >>> >>>> >>> >>>> I think that sounds like kind of support for the concept of "dumb >>> explosive parsing" being advocated for 1+ year ago: >>> >>>> >>> >>>> >>> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy >>> >>>> >>> >>>> I also agree we other Linas'es reasoning in this thread. I would >>> consider giving it a try starting next month if we don't have a >>> breakthrough with DNN-MI-milking-based-MST-Parsing by that time. >>> >>>> >>> >>>> > can be done generically, and not just on language >>> >>>> >>> >>>> I think everyone in bio-informatics dreams of extracting secrets of >>> "dark side of the genome" with something like that ;-) >>> >>>> >>> >>>> Cheers, >>> >>>> >>> >>>> -Anton >>> >>>> >>> >>>> >>> >>>> 28.03.2019 1:24, Linas Vepstas пишет: >>> >>>> >>> >>>> Hi Anton, >>> >>>> >>> >>>> I've cc'ed the link-grammar mailing list, because I describe below >>> some concepts for word-sense disambiguation. I'm also cc'ing the opencog >>> mailing list and ivan vodisek, because after studying hilbert systems, I >>> think he's ready to think about how knowledge extraction can be done >>> generically, and not just on language. >>> >>>> >>> >>>> -- Linas >>> >>>> >>> >>>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail < >>> [email protected]> wrote: >>> >>>>> >>> >>>>> Hi Linas, >>> >>>>> >>> >>>>> >I'd call it "interesting", but maybe not "golden" >>> >>>>> >>> >>>>> These are randomly selected sentences from "Gutenberg Children" >>> corpus: >>> >>>>> >>> >>>>> >>> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ >>> >>>>> >>> >>>>> "Gutenberg Children silver standard" is LG-English parses: >>> >>>>> >>> >>>>> >>> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull >>> >>>>> >>> >>>>> "Gutenberg Children gold standard" is subset of "silver standard" >>> with semi-random selection of sentences skipping direct speech and doing >>> manual verification of the links. >>> >>>>> >>> >>>>> So as long as we are training on "Gutenberg Children" corpus, >>> having the test on the same "Gutenberg Children" seems reasonable, right? >>> >>>> >>> >>>> >>> >>>> Yes. You still need to verify that each word in the "golden" corpus >>> occurs at least N=10 or 20 times in the training corpus. The dependency of >>> accuracy on N is not generally known, but it is very clear that if a word >>> occurs only N=3 times in the training corpus, then whatever is learned >>> about it will be very low quality. >>> >>>> >>> >>>>> >>> >>>>> But thanks, we may have put mire effort in removal of ancient >>> constructions and words even if these are present in the corpus. >>> >>>> >>> >>>> If you consistently train on 19th century literature, and then >>> evaluate 19th-century literature comprehension, that's fine. Just don't >>> expect it to work for 21st century blog posts. >>> >>>> >>> >>>> The strongest effect will be the N=number of observations effect. >>> >>>> >>> >>>>> >>> >>>>> >Anyway -- you only indicate pair-wise word-links. Is the omission >>> of disjuncts intentional? >>> >>>>> >>> >>>>> If you have all links in the sentence, you can construct all of >>> the disjuncts with o ambiguity, correct? >>> >>>> >>> >>>> No, but only because you did not indicate the link-type. The whole >>> point of a clustering step is to obtain a link-type; if you discard it, you >>> will never get better-than-MST results. The link-type is critical for >>> obtaining the word-classes. The whole point of learning is to learn the >>> word-classes; you've learned very little, if you know only word-pairs. >>> >>>> >>> >>>> Consider this example: >>> >>>> >>> >>>> I saw wood >>> >>>> I saw some wood >>> >>>> >>> >>>> A solution that would be "almost perfect" (or "golden") would be >>> this: >>> >>>> >>> >>>> saw: {performer-of-actions}- & {sculptable-mass}+; >>> >>>> saw: {observer}- & {viewable-thing}+; >>> >>>> >>> >>>> These disambiguate the two different senses of the word "saw". >>> It's impossible to have word-sense disambiguation without actually having >>> these disjuncts. The word-pairs alone are not sufficient to report the >>> link-type connecting the words. Clustering gives the other dictionary >>> entries: >>> >>>> >>> >>>> I: {performer-of-actions}+ or {observer}+; >>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- & >>> {viewable-thing}-); >>> >>>> some: {quantity-determiner}+; >>> >>>> >>> >>>> Thus, the pronoun "I" also belong to two different word-sense >>> categories: performers and observers. Compare to: >>> >>>> >>> >>>> "The chainsaw saws wood" -- a "chainsaw" can be a "performer of >>> actions" but cannot be an "observer". >>> >>>> "The dog saw some wood" -- dogs can be observers. They can perform >>> some actions; like run, jump, but they cannot saw, hammer, cut, stab. >>> >>>> >>> >>>> The link-type is absolutely crucial to understanding a word. The >>> language-learning project is all about learning the link-types. Without >>> correct link-type assignments, you cannot have correct parses. >>> >>>> >>> >>>> ... which is 100% of the problem with MST. The problem with MST is >>> not so much that "its not accurate" -sure, it is not terribly accurate. But >>> even if MST or some MST-replacement was 100% accurate, it would still be >>> "wrong" because it fails to indicate the link-type. If you want to >>> understand a sentence, you MUST know the link-types! >>> >>>> >>> >>>> Otherwise, you just have "green ideas sleep furiously", which >>> parses, but only because the link types have been erased, or made stupid. >>> Here's a stupid grammar: >>> >>>> >>> >>>> ideas: {adjective}- & {verb}+; >>> >>>> green: {adjective}+; >>> >>>> >>> >>>> which allows "green ideas" to parse. But of course, this is wrong; >>> it should have been: >>> >>>> >>> >>>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+; >>> >>>> green: {physical-object-modifier}+; >>> >>>> >>> >>>> and now it is clear that "green ideas" cannot parse, because the >>> link-types clash. >>> >>>> >>> >>>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun >>> ...) you will get very low quality grammars. >>> >>>> >>> >>>> * If you cluster to 200 or 300 clusters, you get sort-of-OK >>> grammars. This is what deep-learning/neural-nets do: this is why the >>> deep-learning systems seem to give nice results: 200 or 300 features is >>> enough to start having adequate functional distinctions (e.g. the famous >>> "king - male+female=queen" example, or "paris-france+germany=berlin" >>> example) >>> >>>> >>> >>>> * If you cluster to 3K to 8K clusters, you start having a quite >>> decent model of language >>> >>>> >>> >>>> * Note that wordnet has 117K "synsets". >>> >>>> >>> >>>> Note that in the above example: >>> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- & >>> {viewable-thing}-); >>> >>>> >>> >>>> the things in the curly-braces are effectively "synsets". >>> >>>> >>> >>>> The next set of goal-posts is to have disjuncts, of maybe >>> low-medium quality, and use these to extract ontologies. e.g. >>> >>>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing} >>> >>>> >>> >>>> You can try to do this by clustering but there are probably better >>> ways of discovering ontology. >>> >>>> >>> >>>> >>> >>>>> >>> >>>>> >Also -- no hint of any word-classes or part-of-speech tagging? >>> This is surely important to evaluate as well, or is this to be done in some >>> other way? i.e. to evaluate if "Pivi" was correctly clustered with other >>> given names? Or that lama/llama was clustered with other four-legged >>> animals? >>> >>>>> >>> >>>>> We don't have that in MST-Parsing, right? We need this corpus to >>> assess the quality of the MST-Parsing so we don't need part-of-speech >>> information for that. >>> >>>> >>> >>>> But we know that MST parsing is shit. Stop wasting time on MST or >>> trying to "improve" it. We already know that it is close to a high-entropy >>> path to structure; trying to squeeze a few more percent of entropy is not >>> worth the effort, not at this time. Focus on finding a high-entropy >>> structure extraction algorithm, don't waste time on MST. >>> >>>> >>> >>>> You should be focusing on extracting disjuncts, word-classes, >>> word-senses, and trying to improve the quality of those. If you obtain a >>> high-entropy path to these structures, the quality of your parses will >>> automatically improve. Focus on the entropy numbers. Try to maximize that. >>> >>>> >>> >>>>> The clustering is able to do that anyway - see the graphs in the >>> end of the last year report: >>> >>>>> >>> >>>>> >>> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou >>> >>>>> >>> >>>>> >Also -- I can't tell -- is it free of loops, or are loops >>> allowed? Allowing loops tends to provide stronger, more accurate parses. >>> Loops act as constraints. >>> >>>>> >>> >>>>> The loops and crossing links are not allowed in the MST-Parser >>> now. If we allow them in the test corpus, how could it make assessment of >>> MST-Parses better? >>> >>>>> >>> >>>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's >>> directions. >>> >>>> >>> >>>> >>> >>>> Not to say bad things about Ben, but I'm certain he has not >>> actually thought about this problem very much. He is very very busy doing >>> other things; he is not thinking about this stuff. I have repeatedly tried >>> to explain the issues to him, and its quite clear that he is far away from >>> understanding them, from working at the level that I would like to have you >>> and your team work at. >>> >>>> >>> >>>> I'm trying to have you make small, quantified baby-steps, to verify >>> the accuracy of your methods and data. What I'm seeing is that you are >>> attempting to make giant-steps, without verification, and then getting >>> low-quality results, without understanding the root causes for them. You >>> can't dig yourself out of a ditch, and digging harder and more furiously >>> won't raise the accuracy of the parse results. >>> >>>> >>> >>>> --linas >>> >>>> >>> >>>>> We have your MST-Parser-less idea on the map but we are NOT trying >>> it now: >>> >>>>> >>> >>>>> https://github.com/singnet/language-learning/issues/170 >>> >>>>> >>> >>>>> We may try it after we explore the account for costs >>> >>>>> >>> >>>>> https://github.com/singnet/language-learning/issues/183 >>> >>>>> >>> >>>>> Thanks, >>> >>>>> >>> >>>>> -Anton >>> >>>>> >>> >>>>> 24.03.2019 9:24, Linas Vepstas пишет: >>> >>>>> >>> >>>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand >>> on the knob, trembling like a leaf." correctly. It is one of a class of >>> sentences it does not know about. Which is maybe OK, because ideally, the >>> learned grammar will be able to do this. But today, LG cannot. >>> >>>>> >>> >>>>> --linas >>> >>>>> >>> >>>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas < >>> [email protected]> wrote: >>> >>>>>> >>> >>>>>> Anton, >>> >>>>>> >>> >>>>>> It's certainly an unusual corpus, and it might give you rather >>> low scores. I'd call it "interesting", but maybe not "golden". Although I >>> suppose it depends on your training corpus. Here are some problems that >>> pop out: >>> >>>>>> >>> >>>>>> First sentence -- >>> >>>>>> "the old beast was whinnying on his shoulder" -- the word >>> "whinnying" is a fairly rare English verb -- you could read half-a-million >>> wikipedia articles, and not see it once. You could read lots of >>> 19th-century or early-20th century cowboy/adventure novels, (like what >>> you'd find on Project Gutenberg) and maybe see it some fair amount. Even >>> then -- to "whinny on a shoulder" seems bizarre.. I guess he's hugging the >>> horse? How often does that happen, in any cowboy novel? "to whinny on >>> something" is an extremely rare construction. It will work only if you've >>> correctly categorized "whinny" as a verb that can take a preposition. Are >>> your clustering algos that good, yet, to correctly cluster rare words into >>> appropriate verb categories? >>> >>>>>> >>> >>>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've >>> never heard of it as a name before. Your training data is going to be >>> extremely slim on this. And lack of training data means poor statistics, >>> which means low scores. Unless -- again, your clustering code is good >>> enough to place "Jims" in a "proper name" cluster... >>> >>>>>> >>> >>>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, >>> almost archaic verb. These days, everyone spells llama with two ll's not >>> one. Unless your talking about Buddhist monks, its a typo. >>> >>>>>> >>> >>>>>> "you understand?" is .. awkward. Common in speech, uncommon in >>> writing. Unlikely that you'll have enough training data for this. >>> >>>>>> >>> >>>>>> "Willard" is an uncommon name. Does your training corp[us have a >>> sufficient number of mentions of Willard? Do you have clustering working >>> well enough to stick "Willard" into a cluster with other names? >>> >>>>>> >>> >>>>>> "it is so with Sammy Jay" is clearly archaic English. >>> >>>>>> >>> >>>>>> "he hasn't any relations here" is clearly archaic, an >>> olde-fashioned construction. >>> >>>>>> >>> >>>>>> "Pivi said not one word" - again, a clearly old-fashioned >>> construction. Does the training set contain enough examples of "Pivi" to >>> recognize it as a name? Are names clustering correctly? >>> >>>>>> >>> >>>>>> Any sentence with an inversion is going to sound old-fashioned. >>> All of the sentences in that corpus sound old-fashioned. Which maybe is OK >>> if you are training on 19th century Gutenberg texts .. but its certainly >>> not modern English. Even when I was a child, and I read those old >>> crumbly-yellow paper adventure books, part of the fun was that no one >>> actually talked that way -- not at school, not at home, not on TV. It was >>> clearly from a different time and place -- an adventure. >>> >>>>>> >>> >>>>>> Anyway -- you only indicate pair-wise word-links. Is the omission >>> of disjuncts intentional? Also -- no hint of any word-classes or >>> part-of-speech tagging? This is surely important to evaluate as well, or is >>> this to be done in some other way? i.e. to evaluate if "Pivi" was >>> correctly clustered with other given names? Or that lama/llama was >>> clustered with other four-legged animals? >>> >>>>>> >>> >>>>>> Also -- I can't tell -- is it free of loops, or are loops >>> allowed? Allowing loops tends to provide stronger, more accurate parses. >>> Loops act as constraints. >>> >>>>>> >>> >>>>>> -- Linas >>> >>>>>> >>> >>>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail < >>> [email protected]> wrote: >>> >>>>>>> >>> >>>>>>> Hi Linas, Andes and whoever understands LG and English well >>> enough both. >>> >>>>>>> >>> >>>>>>> Attached are first 100 sentences for GC "gold standard" - >>> manually checked based on LG parses. >>> >>>>>>> >>> >>>>>>> We are expecting more to come in the next two weeks. >>> >>>>>>> >>> >>>>>>> To enable that, please have cursory review of the corpus and let >>> us know if there are corrections still needed so your corrections will be >>> used as a reference to fix the rest and keep going further. >>> >>>>>>> >>> >>>>>>> Thank you, >>> >>>>>>> >>> >>>>>>> -Anton >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> -- >>> >>>>>>> You received this message because you are subscribed to the >>> Google Groups "lang-learn" group. >>> >>>>>>> To unsubscribe from this group and stop receiving emails from >>> it, send an email to [email protected]. >>> >>>>>>> To post to this group, send email to [email protected] >>> . >>> >>>>>>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com >>> . >>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -- >>> >>>>>> cassette tapes - analog TV - film cameras - you >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> -- >>> >>>>> cassette tapes - analog TV - film cameras - you >>> >>>>> >>> >>>>> -- >>> >>>>> -Anton Kolonin >>> >>>>> skype: akolonin >>> >>>>> cell: +79139250058 >>> >>>>> [email protected] >>> >>>>> https://aigents.com >>> >>>>> https://www.youtube.com/aigents >>> >>>>> https://www.facebook.com/aigents >>> >>>>> https://medium.com/@aigents >>> >>>>> https://steemit.com/@aigents >>> >>>>> https://golos.blog/@aigents >>> >>>>> https://vk.com/aigents >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> cassette tapes - analog TV - film cameras - you >>> >>>> -- >>> >>>> You received this message because you are subscribed to the Google >>> Groups "lang-learn" group. >>> >>>> To unsubscribe from this group and stop receiving emails from it, >>> send an email to [email protected]. >>> >>>> To post to this group, send email to [email protected]. >>> >>>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com >>> . >>> >>>> For more options, visit https://groups.google.com/d/optout. >>> >>>> >>> >>>> -- >>> >>>> -Anton Kolonin >>> >>>> skype: akolonin >>> >>>> cell: +79139250058 >>> >>>> [email protected] >>> >>>> https://aigents.com >>> >>>> https://www.youtube.com/aigents >>> >>>> https://www.facebook.com/aigents >>> >>>> https://medium.com/@aigents >>> >>>> https://steemit.com/@aigents >>> >>>> https://golos.blog/@aigents >>> >>>> https://vk.com/aigents >>> >> >>> >> >>> >> >>> >> -- >>> >> cassette tapes - analog TV - film cameras - you >>> >> >>> >> -- >>> >> -Anton Kolonin >>> >> skype: akolonin >>> >> cell: +79139250058 >>> >> [email protected] >>> >> https://aigents.com >>> >> https://www.youtube.com/aigents >>> >> https://www.facebook.com/aigents >>> >> https://medium.com/@aigents >>> >> https://steemit.com/@aigents >>> >> https://golos.blog/@aigents >>> >> https://vk.com/aigents >>> > >>> > >>> > >>> > -- >>> > cassette tapes - analog TV - film cameras - you >>> >>> >>> >>> -- >>> Ben Goertzel, PhD >>> http://goertzel.org >>> >>> "Listen: This world is the lunatic's sphere, / Don't always agree >>> it's real. / Even with my feet upon it / And the postman knowing my >>> door / My address is somewhere else." -- Hafiz >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> -- >> You received this message because you are subscribed to the Google Groups >> "lang-learn" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com >> <https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >> -- >> -Anton Kolonin >> skype: akolonin >> cell: >> [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents >> >> > > -- > cassette tapes - analog TV - film cameras - you > -- > You received this message because you are subscribed to the Google Groups > "lang-learn" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/lang-learn/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com > <https://groups.google.com/d/msgid/lang-learn/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- > -Anton Kolonin > skype: akolonin > cell: > [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents > > -- cassette tapes - analog TV - film cameras - you -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36hBoqJ9jrhLnxnVbf-PApzCbvaYtK4JWiqQv96X6UFYQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
