On Tue, Apr 2, 2019 at 12:57 AM Anton Kolonin @ Gmail <[email protected]> wrote:
> Hi Linas, > > Are you saying that "while ULL team has found strong linear correlation > between A) quality (F1) on input parses and B) quality (F1) of the output > parses based on the grammar learned from the input parses, this phenomenon > is due to the fact that they test on the entire input corpus so this > phenomena should go away once they test on gold standard corpus consisting > only of sentences with high-frequency words"? > I am saying that I have not seen any evidence at all that you actually constructed or counted disjuncts, or that you clustered disjuncts, or that you controlled or managed counting in any way. So -- you did something ... but I don't understand what that "something" is, and, based on these conversations, that "something" does not match up with what I had hoped that you would be doing. It's not just high-frequency words. Its also how you perform clustering. Are you using MI for that? or cosines for that? Are you handling word-sense disambiguation, or not? How did you handle WSD? Through orthogonalization of cosines? Through maximizatino of MI? By computing a markov vector? Some other way? Did you perform any data cuts during orhtogonalization/maximization? What kind of cuts were they? How do the cuts affect the F1-score? All of these things deserve "instrumental verification". Without them, I don't know how to assign any meaning to F1-scores (or ROC curves, which you haven't shown - and even if you did show them, I would not know what they mean, until the above questions are resolved.) So I've got this ball of questions, and I'm getting unclear, confusing answers to them. -- Linas Best regards, > > -Anton > > > 02.04.2019 5:38, Linas Vepstas пишет: > > OK, There's clearly a lot ow work happening in linguistics these days, > that I have fallen behind on reading. > > The nature of the conversations here has been frustrating, because so far, > it sounds like an attempt to evade the "central limit theorem" -- > https://en.wikipedia.org/wiki/Central_limit_theorem > > There are two related ideas I'm trying to get across: one is that if you > make enough observations of a phenomenon, eventually, the central-limit > theorem kicks in, and smooths over random variations. Specifically, I > claim that, despite MST being imperfect, a large number of observations > should smooth over the imperfections. I believe this to be true, (but I > could be wrong). > > The other idea is that the golden test corpus must avoid accidentally > testing disjuncts far away from the central limit -- to avoid, as it were, > making statements analogous to "Well, I flipped the coin three times, and I > did not get 50-50 odds, therefore the theory doesn't work". You have to > flip the coin at least N times, for some large N. Here, for MST, we don't > know how big N has to be, we don't have a good plan for determining N. > It's worse, cause everything is Zipfian aka 1/f noise. It is possible that > BERT or other approaches allow smaller values of N to work, but this is > also not clear. > > Its also not clear that BERT would converge to a different limit than MST > - the central-limit theorem says there is only one limit -- not two. But > perhaps I'm misapplying it, perhaps I'm neglecting some important effect. > Without measurements, its hard to guess what that effect is (if it even > exists). > > Anyway, I have a backlog of half-a-dozen important unread papers, so I'll > try to get around to that "real soon now". > > --linas > > > > On Mon, Apr 1, 2019 at 12:15 AM Ben Goertzel <[email protected]> wrote: > >> "Replacing MST by DNN/BERT" is a strange way to put it... >> >> DNN/BERT builds a pretty complex and comprehensive language model, >> much beyond what is done by calculation of MI values and similar >> >> The extraction of a parse dag satisfying syntactic constraints (no >> links cross, covering all words in the sentence, connected graph) is a >> conceptually simple step, and nobody is spending much time on this >> step indeed... >> >> The question of how to assign a quantitative weight to the relation >> btw two word-instances in a sentence, taking into account the specific >> context in that sentence, but also the history of co-utilization of >> those words (or other similar words), is less conceptually simple and >> this is one place I think DNN language models can help >> >> Using MST or similar parsing based on numbers exported from DNN >> language models is one way of extracting symbolic-ish structured >> knowledge from these big messy subsymbolic probabilistic language >> models... >> >> The DNNs in use now like BERT do not really satisfy me on a >> theoretical or conceptual level, but they have been tuned to work >> pretty nicely and they have been implemented pretty efficiently on >> multi-GPU hardware -- so, given this and given the quality of the >> recent practical results obtained with them -- I consider it well >> worth exploring how to use them as tools in our pursuits for grammar >> and semantics learning >> >> -- Ben >> >> On Mon, Apr 1, 2019 at 2:07 PM Linas Vepstas <[email protected]> >> wrote: >> > >> > >> > >> > On Sun, Mar 31, 2019 at 10:51 PM Anton Kolonin @ Gmail < >> [email protected]> wrote: >> >> >> >> Hi Linas, I like this thread more and more :-) >> > >> > I don't. I use a lot of CAPITALIZED WORDS below. There is a deep and >> dark fundamental misunderstanding, and I am sometimes at wits end trying to >> figure out why, and how to explain things in an understandable fashion. >> >> >> >> >But somehow, I suspect... Isn't this why OpenCog has "unified rule >> engine" (URE) instead of link grammar at its core, >> >> >> >> Linas, the "extraction of phrasemes" goal approaching has been >> discussed exactly in terms of MST->GL->URL on the last fall in Hong Kong >> discussion: >> https://docs.google.com/document/d/13YyqtGud0GAbVaFcc94kAd2LhGf7jTr5XDYgiuC294c/edit >> >> >> >> That is: >> >> >> >> 1) Do MST-parsing to get word links proto-disjuncts >> >> >> >> 2) Do Grammar Learning to cluster and conclude word categories and >> rules with disjuncts >> >> >> >> 3) Do URE-kind-of-thing to build the rules into "phrasemes" or >> "sections" or "patterns". >> > >> > Yes. >> >> >> >> However, your current discourse and our current results just show that >> "no one is be able to do reasonable MST-parsing" so the above is just waste >> of time, correct? >> > >> > No. Very much no. I'm saying the opposite of that. You can replace MST >> by almost *ANYTHING* else, and the quality of your results WILL NOT CHANGE! >> > >> > If the quality of your results depends on the quality of MST, you are >> DOING SOMETHING WRONG! >> > >> > I'm utterly flabbergasted. I don't know how many more times I can say >> this: stop wasting time on this unimportant step! >> >> >> >> At the time we speak, Ben, Alexely, Sergey and Asuares are trying to >> use DNN/BERT magic to do the trick 1. >> > >> > I want to call this "a complete waste of time". It will almost surely >> not improve the quality of the results! I don't understand why four smart >> people think that replacing MST by BERT will make any difference at all! >> It should not matter! Nothing depends on this step! Anything at all, >> anything with a probability better than random chance, is sufficient! Why >> isn't this obvious? >> > >> > If Ben is reading this: I recall talking to Ben about this in an >> ice-cream shop in Berlin, for an AGI conference, and he seemed to >> understand back then. I have no idea why he changed his mind. I really do >> not understand why everyone spends so much time obsessing about MST. Is >> this a "color of the bike shed" problem? >> https://en.wikipedia.org/wiki/Law_of_triviality >> > >> > MST-vs.-BERT==color-of-bike-shed >> > >> > Just use MST. It's simple. It works. It gives good results. Stop >> trying to improve it. The interesting problems are elsewhere! Just use >> MST, and move on to the good stuff! >> >> >> >> To my mind, that may get possible only if the DNN/BERT magic do the >> trick having the steps 2 and 3 done under the hood. If this is done, in >> such case, we don't need to do 2 and 3 after we have the DNN/BERT-based >> model, because we can simply "milk-out" the grammar rules out of DNN/BERT >> micelium for that. And we don't need the ULL as well by the way, because we >> just need DNN/BERT and rows of different sorts of milk machines around it. >> > >> > So why are you bothering to work on ULL? >> >> >> >> So, instead of solving the problem of constructing the pipeline for >> learning grammar from raw text we need to solve the problem of milking the >> grammar out of DNN/BERT model trained on these texts, right? >> > >> > Because I don't think that you know how to milk lexical functions out >> of DNN/BERT -- We've wasted more than a year talking about MST. Instead of >> endlessly talking about MST, you could have JUST USED IT, WITHOUT ANY >> MODIFICATIONS, gotten good results, and spent the year working on something >> interesting! >> > >> > Again: replacing MST by DNN/BERT with something else will NOT IMPROVE >> the accuracy! You'll have exactly the same accuracy as before, and if your >> accuracy improves, it is because you are doing something wrong! >> > >> >> However, either way, we need to understand algorithmic machinery of >> how the links assemble in disjuncts and disjuncts assemble into sections, >> through the universe-scale combinatorial explosion. >> > >> > No. That is the OPPOSITE of what ACTUALLY HAPPENS!!!! >> >> >> >> And I agree that clustering and categorizing word and links (and then >> disjuncts and sections, right) is part of the process - explicitly in ULL >> pipeline or implicitly deep in DNN/BERT darkness. >> > >> > It is NOT DEEP AND DARK. I wrote not one but TWO PAPERS on this, >> CASTING LIGHT ON THAT DARKNESS >> > >> > I'm frustrated to the 43rd degree on why I cannot seem to have a >> reasonable conversation with any other human being about any of this. >> > >> > -- Linas >> > >> >> Cheers, >> >> >> >> -Anton >> >> >> >> >> >> 01.04.2019 9:17, Linas Vepstas: >> >> >> >> >> >> >> >> On Thu, Mar 28, 2019 at 10:22 AM Ivan V. <[email protected]> wrote: >> >>> >> >>> Linas Vepstas wrote: >> >>> >> >>> >... knowledge extraction can be done generically, and not just on >> language. >> >>> >> >>> If link grammar would be Turing complete, this might be possible >> right away. >> >> >> >> >> >> In my experience, thinking about Turing completeness is unproductive >> and a distraction. >> >> >> >>> But somehow, I suspect... Isn't this why OpenCog has "unified rule >> engine" (URE) instead of link grammar at its core, >> >> >> >> >> >> No. It has the rule-engine because back then, I did not understand >> sheaves. I'm starting to think that the rule engine is a strategic >> mistake. The original idea is that rule-application is the main conceptual >> abstraction of term-rewriting. One rewrites, or proves theorems by >> applying sequences of rules. It turns out that discovering the right >> sequence is hard. Finding correct long sequences is hard - a combinatorial >> explosion. >> >> >> >> The openpsi system addresses some of these issues. Unfortunately, it's >> current implementation is a tangle of rule-selection mechanisms, and >> theories of human psychology. It's probably better than the URE, but is >> currently not as powerful. >> >> >> >> I'm trying to place a theory of sheaves as a replacement for URE, and >> as the natural generalization of openpsi, but I've successfully >> self-sabotaged myself in these efforts. >> >> >> >>> >> >>> and with URE things get much more complicated. I'm sorry, but that is >> still a Gordian knot to me, considering all of my modest knowledge. >> >> >> >> >> >> We all have modest knowledge. That is the nature of the human >> condition. >> >> >> >>> >> >>> On the other hand, if someone really smart would provide automatic >> grammar extraction by means of unrestricted grammar, I believe that would >> be it. >> >> >> >> >> >> Yes, that is the goal of the language-learning project. However, as >> noted in my last email (on the link-grammar list) it is not enough to just >> learn a semi-Thue system, declare victory, and go home. The example I gave >> there: >> >> >> >> "I think that you should give that car a second look" >> >> "you should really give that song a second listen" >> >> "maybe you should give Sue a second chance". >> >> >> >> Learning to parse these "set phrases" or phrasemes is equivalent to >> learning a semi-Thue system; however, its not enough to realize that all >> three are forms of advice-giving, having "conserved" or "fixed" regions "x >> YOU SHOULD y GIVE z SECOND w" where z is very highly variable having >> millions of variations, and w only has a few dozen allowed variations. >> Note that the words "fixed", "conserved", "variable" are words used in >> genetics and proteomics and antibody structure. Its the same idea. >> >> >> >> The goal of learning lexical functions (LF's) is to learn that all >> three are advice-giving forms, and also to learn what is, and what can be >> plugged in for x,y,z,w. So, although a super-whiz-bang grammar learner >> capable of learning context-sensitive languages should be able to learn "x >> YOU SHOULD y GIVE z SECOND w", it still will not know the *meaning* of this >> phrase. To know the *meaning*, you have to know the acceptable ranges (as >> fuzzy-sets) of x,y,z,w. >> >> >> >> To conclude, thinking about Turing-completeness is a waste of time, >> because Turing completeness only tells you that "x YOU SHOULD y GIVE z >> SECOND w" is recursively enumerable; it does not tell you what it actually >> means. >> >> >> >> Put another way: having a universal Turing machine is not the same as >> knowing how some particular program works. Automagically learning a >> context-sensitive grammar is not enough to know what that grammar is >> "saying/doing". >> >> >> >> -- Linas >> >> >> >>> >> >>> >> >>> Thank you, >> >>> Ivan V. >> >>> >> >>> >> >>> čet, 28. ožu 2019. u 07:58 Anton Kolonin @ Gmail <[email protected]> >> napisao je: >> >>>> >> >>>> Ben, Linas, >> >>>> >> >>>> >But we know that MST parsing is shit. Stop wasting time on MST or >> trying to "improve" it. >> >>>> >> >>>> I think that sounds like kind of support for the concept of "dumb >> explosive parsing" being advocated for 1+ year ago: >> >>>> >> >>>> >> https://docs.google.com/document/d/14MpKLH5_5eVI39PRZuWLZHa1aUS73pJZNZzgigCWwWg/edit#heading=h.aqo9bumb3doy >> >>>> >> >>>> I also agree we other Linas'es reasoning in this thread. I would >> consider giving it a try starting next month if we don't have a >> breakthrough with DNN-MI-milking-based-MST-Parsing by that time. >> >>>> >> >>>> > can be done generically, and not just on language >> >>>> >> >>>> I think everyone in bio-informatics dreams of extracting secrets of >> "dark side of the genome" with something like that ;-) >> >>>> >> >>>> Cheers, >> >>>> >> >>>> -Anton >> >>>> >> >>>> >> >>>> 28.03.2019 1:24, Linas Vepstas пишет: >> >>>> >> >>>> Hi Anton, >> >>>> >> >>>> I've cc'ed the link-grammar mailing list, because I describe below >> some concepts for word-sense disambiguation. I'm also cc'ing the opencog >> mailing list and ivan vodisek, because after studying hilbert systems, I >> think he's ready to think about how knowledge extraction can be done >> generically, and not just on language. >> >>>> >> >>>> -- Linas >> >>>> >> >>>> On Mon, Mar 25, 2019 at 1:39 AM Anton Kolonin @ Gmail < >> [email protected]> wrote: >> >>>>> >> >>>>> Hi Linas, >> >>>>> >> >>>>> >I'd call it "interesting", but maybe not "golden" >> >>>>> >> >>>>> These are randomly selected sentences from "Gutenberg Children" >> corpus: >> >>>>> >> >>>>> >> http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/lower_LGEng_token/ >> >>>>> >> >>>>> "Gutenberg Children silver standard" is LG-English parses: >> >>>>> >> >>>>> >> http://langlearn.singularitynet.io/data/parses/English/Gutenberg-Children-Books/test/GCB-LG-English-clean.ull >> >>>>> >> >>>>> "Gutenberg Children gold standard" is subset of "silver standard" >> with semi-random selection of sentences skipping direct speech and doing >> manual verification of the links. >> >>>>> >> >>>>> So as long as we are training on "Gutenberg Children" corpus, >> having the test on the same "Gutenberg Children" seems reasonable, right? >> >>>> >> >>>> >> >>>> Yes. You still need to verify that each word in the "golden" corpus >> occurs at least N=10 or 20 times in the training corpus. The dependency of >> accuracy on N is not generally known, but it is very clear that if a word >> occurs only N=3 times in the training corpus, then whatever is learned >> about it will be very low quality. >> >>>> >> >>>>> >> >>>>> But thanks, we may have put mire effort in removal of ancient >> constructions and words even if these are present in the corpus. >> >>>> >> >>>> If you consistently train on 19th century literature, and then >> evaluate 19th-century literature comprehension, that's fine. Just don't >> expect it to work for 21st century blog posts. >> >>>> >> >>>> The strongest effect will be the N=number of observations effect. >> >>>> >> >>>>> >> >>>>> >Anyway -- you only indicate pair-wise word-links. Is the omission >> of disjuncts intentional? >> >>>>> >> >>>>> If you have all links in the sentence, you can construct all of the >> disjuncts with o ambiguity, correct? >> >>>> >> >>>> No, but only because you did not indicate the link-type. The whole >> point of a clustering step is to obtain a link-type; if you discard it, you >> will never get better-than-MST results. The link-type is critical for >> obtaining the word-classes. The whole point of learning is to learn the >> word-classes; you've learned very little, if you know only word-pairs. >> >>>> >> >>>> Consider this example: >> >>>> >> >>>> I saw wood >> >>>> I saw some wood >> >>>> >> >>>> A solution that would be "almost perfect" (or "golden") would be >> this: >> >>>> >> >>>> saw: {performer-of-actions}- & {sculptable-mass}+; >> >>>> saw: {observer}- & {viewable-thing}+; >> >>>> >> >>>> These disambiguate the two different senses of the word "saw". It's >> impossible to have word-sense disambiguation without actually having these >> disjuncts. The word-pairs alone are not sufficient to report the link-type >> connecting the words. Clustering gives the other dictionary entries: >> >>>> >> >>>> I: {performer-of-actions}+ or {observer}+; >> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- & >> {viewable-thing}-); >> >>>> some: {quantity-determiner}+; >> >>>> >> >>>> Thus, the pronoun "I" also belong to two different word-sense >> categories: performers and observers. Compare to: >> >>>> >> >>>> "The chainsaw saws wood" -- a "chainsaw" can be a "performer of >> actions" but cannot be an "observer". >> >>>> "The dog saw some wood" -- dogs can be observers. They can perform >> some actions; like run, jump, but they cannot saw, hammer, cut, stab. >> >>>> >> >>>> The link-type is absolutely crucial to understanding a word. The >> language-learning project is all about learning the link-types. Without >> correct link-type assignments, you cannot have correct parses. >> >>>> >> >>>> ... which is 100% of the problem with MST. The problem with MST is >> not so much that "its not accurate" -sure, it is not terribly accurate. But >> even if MST or some MST-replacement was 100% accurate, it would still be >> "wrong" because it fails to indicate the link-type. If you want to >> understand a sentence, you MUST know the link-types! >> >>>> >> >>>> Otherwise, you just have "green ideas sleep furiously", which >> parses, but only because the link types have been erased, or made stupid. >> Here's a stupid grammar: >> >>>> >> >>>> ideas: {adjective}- & {verb}+; >> >>>> green: {adjective}+; >> >>>> >> >>>> which allows "green ideas" to parse. But of course, this is wrong; >> it should have been: >> >>>> >> >>>> ideas: {noospheric-modifier}- & {concept-manipulating-verb}+; >> >>>> green: {physical-object-modifier}+; >> >>>> >> >>>> and now it is clear that "green ideas" cannot parse, because the >> link-types clash. >> >>>> >> >>>> * If you cluster down to 5 or 6 clusters (adjective, verb, noun ...) >> you will get very low quality grammars. >> >>>> >> >>>> * If you cluster to 200 or 300 clusters, you get sort-of-OK >> grammars. This is what deep-learning/neural-nets do: this is why the >> deep-learning systems seem to give nice results: 200 or 300 features is >> enough to start having adequate functional distinctions (e.g. the famous >> "king - male+female=queen" example, or "paris-france+germany=berlin" >> example) >> >>>> >> >>>> * If you cluster to 3K to 8K clusters, you start having a quite >> decent model of language >> >>>> >> >>>> * Note that wordnet has 117K "synsets". >> >>>> >> >>>> Note that in the above example: >> >>>> wood: {sculptable-mass}- or ({quantity-determiner}- & >> {viewable-thing}-); >> >>>> >> >>>> the things in the curly-braces are effectively "synsets". >> >>>> >> >>>> The next set of goal-posts is to have disjuncts, of maybe low-medium >> quality, and use these to extract ontologies. e.g. >> >>>> {sculptable-mass} is-a {mass} is-a {physical-thing} is-a {thing} >> >>>> >> >>>> You can try to do this by clustering but there are probably better >> ways of discovering ontology. >> >>>> >> >>>> >> >>>>> >> >>>>> >Also -- no hint of any word-classes or part-of-speech tagging? >> This is surely important to evaluate as well, or is this to be done in some >> other way? i.e. to evaluate if "Pivi" was correctly clustered with other >> given names? Or that lama/llama was clustered with other four-legged >> animals? >> >>>>> >> >>>>> We don't have that in MST-Parsing, right? We need this corpus to >> assess the quality of the MST-Parsing so we don't need part-of-speech >> information for that. >> >>>> >> >>>> But we know that MST parsing is shit. Stop wasting time on MST or >> trying to "improve" it. We already know that it is close to a high-entropy >> path to structure; trying to squeeze a few more percent of entropy is not >> worth the effort, not at this time. Focus on finding a high-entropy >> structure extraction algorithm, don't waste time on MST. >> >>>> >> >>>> You should be focusing on extracting disjuncts, word-classes, >> word-senses, and trying to improve the quality of those. If you obtain a >> high-entropy path to these structures, the quality of your parses will >> automatically improve. Focus on the entropy numbers. Try to maximize that. >> >>>> >> >>>>> The clustering is able to do that anyway - see the graphs in the >> end of the last year report: >> >>>>> >> >>>>> >> https://docs.google.com/document/d/1gxl-hIqPQCYPb9NNkyA3sBYUyfwvJFvT1hZ5ZpXsaPc/edit#heading=h.twoiv52o0tou >> >>>>> >> >>>>> >Also -- I can't tell -- is it free of loops, or are loops >> allowed? Allowing loops tends to provide stronger, more accurate parses. >> Loops act as constraints. >> >>>>> >> >>>>> The loops and crossing links are not allowed in the MST-Parser now. >> If we allow them in the test corpus, how could it make assessment of >> MST-Parses better? >> >>>>> >> >>>>> Note, that we ARE working we MST-Parses now - accordingly to Ben's >> directions. >> >>>> >> >>>> >> >>>> Not to say bad things about Ben, but I'm certain he has not actually >> thought about this problem very much. He is very very busy doing other >> things; he is not thinking about this stuff. I have repeatedly tried to >> explain the issues to him, and its quite clear that he is far away from >> understanding them, from working at the level that I would like to have you >> and your team work at. >> >>>> >> >>>> I'm trying to have you make small, quantified baby-steps, to verify >> the accuracy of your methods and data. What I'm seeing is that you are >> attempting to make giant-steps, without verification, and then getting >> low-quality results, without understanding the root causes for them. You >> can't dig yourself out of a ditch, and digging harder and more furiously >> won't raise the accuracy of the parse results. >> >>>> >> >>>> --linas >> >>>> >> >>>>> We have your MST-Parser-less idea on the map but we are NOT trying >> it now: >> >>>>> >> >>>>> https://github.com/singnet/language-learning/issues/170 >> >>>>> >> >>>>> We may try it after we explore the account for costs >> >>>>> >> >>>>> https://github.com/singnet/language-learning/issues/183 >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> -Anton >> >>>>> >> >>>>> 24.03.2019 9:24, Linas Vepstas пишет: >> >>>>> >> >>>>> Also, BTW, link-grammar cannot parse "I just stood there, my hand >> on the knob, trembling like a leaf." correctly. It is one of a class of >> sentences it does not know about. Which is maybe OK, because ideally, the >> learned grammar will be able to do this. But today, LG cannot. >> >>>>> >> >>>>> --linas >> >>>>> >> >>>>> On Sat, Mar 23, 2019 at 9:12 PM Linas Vepstas < >> [email protected]> wrote: >> >>>>>> >> >>>>>> Anton, >> >>>>>> >> >>>>>> It's certainly an unusual corpus, and it might give you rather low >> scores. I'd call it "interesting", but maybe not "golden". Although I >> suppose it depends on your training corpus. Here are some problems that >> pop out: >> >>>>>> >> >>>>>> First sentence -- >> >>>>>> "the old beast was whinnying on his shoulder" -- the word >> "whinnying" is a fairly rare English verb -- you could read half-a-million >> wikipedia articles, and not see it once. You could read lots of >> 19th-century or early-20th century cowboy/adventure novels, (like what >> you'd find on Project Gutenberg) and maybe see it some fair amount. Even >> then -- to "whinny on a shoulder" seems bizarre.. I guess he's hugging the >> horse? How often does that happen, in any cowboy novel? "to whinny on >> something" is an extremely rare construction. It will work only if you've >> correctly categorized "whinny" as a verb that can take a preposition. Are >> your clustering algos that good, yet, to correctly cluster rare words into >> appropriate verb categories? >> >>>>>> >> >>>>>> Second sentence .. "Jims" is a very uncommon name. Frankly, I've >> never heard of it as a name before. Your training data is going to be >> extremely slim on this. And lack of training data means poor statistics, >> which means low scores. Unless -- again, your clustering code is good >> enough to place "Jims" in a "proper name" cluster... >> >>>>>> >> >>>>>> "the lama snuffed blandly" -- "snuffed" is a very uncommon, almost >> archaic verb. These days, everyone spells llama with two ll's not one. >> Unless your talking about Buddhist monks, its a typo. >> >>>>>> >> >>>>>> "you understand?" is .. awkward. Common in speech, uncommon in >> writing. Unlikely that you'll have enough training data for this. >> >>>>>> >> >>>>>> "Willard" is an uncommon name. Does your training corp[us have a >> sufficient number of mentions of Willard? Do you have clustering working >> well enough to stick "Willard" into a cluster with other names? >> >>>>>> >> >>>>>> "it is so with Sammy Jay" is clearly archaic English. >> >>>>>> >> >>>>>> "he hasn't any relations here" is clearly archaic, an >> olde-fashioned construction. >> >>>>>> >> >>>>>> "Pivi said not one word" - again, a clearly old-fashioned >> construction. Does the training set contain enough examples of "Pivi" to >> recognize it as a name? Are names clustering correctly? >> >>>>>> >> >>>>>> Any sentence with an inversion is going to sound old-fashioned. >> All of the sentences in that corpus sound old-fashioned. Which maybe is OK >> if you are training on 19th century Gutenberg texts .. but its certainly >> not modern English. Even when I was a child, and I read those old >> crumbly-yellow paper adventure books, part of the fun was that no one >> actually talked that way -- not at school, not at home, not on TV. It was >> clearly from a different time and place -- an adventure. >> >>>>>> >> >>>>>> Anyway -- you only indicate pair-wise word-links. Is the omission >> of disjuncts intentional? Also -- no hint of any word-classes or >> part-of-speech tagging? This is surely important to evaluate as well, or is >> this to be done in some other way? i.e. to evaluate if "Pivi" was >> correctly clustered with other given names? Or that lama/llama was >> clustered with other four-legged animals? >> >>>>>> >> >>>>>> Also -- I can't tell -- is it free of loops, or are loops >> allowed? Allowing loops tends to provide stronger, more accurate parses. >> Loops act as constraints. >> >>>>>> >> >>>>>> -- Linas >> >>>>>> >> >>>>>> On Thu, Mar 21, 2019 at 11:09 PM Anton Kolonin @ Gmail < >> [email protected]> wrote: >> >>>>>>> >> >>>>>>> Hi Linas, Andes and whoever understands LG and English well >> enough both. >> >>>>>>> >> >>>>>>> Attached are first 100 sentences for GC "gold standard" - >> manually checked based on LG parses. >> >>>>>>> >> >>>>>>> We are expecting more to come in the next two weeks. >> >>>>>>> >> >>>>>>> To enable that, please have cursory review of the corpus and let >> us know if there are corrections still needed so your corrections will be >> used as a reference to fix the rest and keep going further. >> >>>>>>> >> >>>>>>> Thank you, >> >>>>>>> >> >>>>>>> -Anton >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> You received this message because you are subscribed to the >> Google Groups "lang-learn" group. >> >>>>>>> To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected]. >> >>>>>>> To post to this group, send email to [email protected]. >> >>>>>>> To view this discussion on the web visit >> https://groups.google.com/d/msgid/lang-learn/bde76364-a578-4ab8-8ac5-2f49f794072b%40gmail.com >> . >> >>>>>>> For more options, visit https://groups.google.com/d/optout. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> cassette tapes - analog TV - film cameras - you >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> cassette tapes - analog TV - film cameras - you >> >>>>> >> >>>>> -- >> >>>>> -Anton Kolonin >> >>>>> skype: akolonin >> >>>>> cell: +79139250058 >> >>>>> [email protected] >> >>>>> https://aigents.com >> >>>>> https://www.youtube.com/aigents >> >>>>> https://www.facebook.com/aigents >> >>>>> https://medium.com/@aigents >> >>>>> https://steemit.com/@aigents >> >>>>> https://golos.blog/@aigents >> >>>>> https://vk.com/aigents >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> cassette tapes - analog TV - film cameras - you >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> Groups "lang-learn" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected]. >> >>>> To post to this group, send email to [email protected]. >> >>>> To view this discussion on the web visit >> https://groups.google.com/d/msgid/lang-learn/CAHrUA36dE5ihtcCaqPv_q4qgmbEy-yX6kTkUHyLZmjk6d4VfOg%40mail.gmail.com >> . >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>>> >> >>>> -- >> >>>> -Anton Kolonin >> >>>> skype: akolonin >> >>>> cell: +79139250058 >> >>>> [email protected] >> >>>> https://aigents.com >> >>>> https://www.youtube.com/aigents >> >>>> https://www.facebook.com/aigents >> >>>> https://medium.com/@aigents >> >>>> https://steemit.com/@aigents >> >>>> https://golos.blog/@aigents >> >>>> https://vk.com/aigents >> >> >> >> >> >> >> >> -- >> >> cassette tapes - analog TV - film cameras - you >> >> >> >> -- >> >> -Anton Kolonin >> >> skype: akolonin >> >> cell: +79139250058 >> >> [email protected] >> >> https://aigents.com >> >> https://www.youtube.com/aigents >> >> https://www.facebook.com/aigents >> >> https://medium.com/@aigents >> >> https://steemit.com/@aigents >> >> https://golos.blog/@aigents >> >> https://vk.com/aigents >> > >> > >> > >> > -- >> > cassette tapes - analog TV - film cameras - you >> >> >> >> -- >> Ben Goertzel, PhD >> http://goertzel.org >> >> "Listen: This world is the lunatic's sphere, / Don't always agree >> it's real. / Even with my feet upon it / And the postman knowing my >> door / My address is somewhere else." -- Hafiz >> > > > -- > cassette tapes - analog TV - film cameras - you > -- > You received this message because you are subscribed to the Google Groups > "lang-learn" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com > <https://groups.google.com/d/msgid/lang-learn/CAHrUA35rQWNZDg-LmgBVjcLX%3DF6nceWvDXFq%2B-mfc4rJiqqG3g%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- > -Anton Kolonin > skype: akolonin > cell: > [email protected]https://aigents.comhttps://www.youtube.com/aigentshttps://www.facebook.com/aigentshttps://medium.com/@aigentshttps://steemit.com/@aigentshttps://golos.blog/@aigentshttps://vk.com/aigents > > -- cassette tapes - analog TV - film cameras - you -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36nTsYOOcsJf3t%2BxnSQYF2FYNK4yj-bEXgNgOtXL3NrVw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
