On Wed, Jul 1, 2020 at 11:42 AM Amirouche Boubekki < [email protected]> wrote:
> Le mar. 30 juin 2020 à 20:04, Linas Vepstas <[email protected]> a > écrit : > > > > Hi Amirouche, > > > > Such statistical systems might currently do as well as or better than > grammar, > > They will always fail on new grammar rules, and re-training the > algorithm over those is painful I guess. Maybe new grammar rules is > not a use-case since nowadays there is grammar-police-robot in every > browser... > Well, I was trying to be expansive. Some people want to build a grammar-police-robot, and there are various ways of doing that. For opencog, a goal is to understand speech, and if you look at transcripts of zoom videos, you quickly appreciate that no one speaks in a "grammatical" fashion ... Written and spoken English are almost two different languages. > > since, it turns out, writing a complete grammar is impossible; there is > always yet one more, extremely rare exception to every rule. Roughly > speaking, grammar rules have a Zipfian distribution. > > I mean probably yes with the approach taken by link-grammar where only > a couple of experts can edit the grammar. My goal is to make it also > easy for the user to improve the grammar, I believe that this is a fundamentally impossible task. The proof would be, if it was easy, some linguist would have done it by now. I mean, linguists have been working on grammar since before there were computers, and what have we got? hence the single source of > truth database There is no single source of truth for language. Every human being speaks and understands a language differently. You are not aware of it, because there is enough overlap that you can communicate with others. But, if you take samples of Irish-American or German-American from the 1920's or even Black-American or Hispanic-American from modern times, you will discover that it's grammar is quite drastically different than New York Times English. And there are more gradations: the rules that work for Irish-American fail on British-Irish. The geographical separation, and the separation of 2-5 decades in time, means that these two dialects have diverged. They are quite different. There is no single source of truth. > that will include the dictionary of words, grammar > relations between words, and test cases for each relation. There is not even one unique parse for any given sentence. Poetry is filled with ambiguous sentences. But if you want a textbook example: "I saw the man with the telescope" which has two parses that are both equally plausible. This is surprisingly common. Mild alterations of wording can have dramatic changes in parses and meaning. Simple substitutions of "synonymous" words can have dramatic changes in parses and meaning. based on a subset > of English grammar that is not ambiguous There is no such subset. It does not exist. > > "A sat solver and an okvs" -- we've played with SAT. Basically, the SAT > solvers are slower. They used to be sometimes-faster, but Amir's work fixed > that. > > How much slower ? 2 or 3 times slower ? Now? maybe 1.1x to 5x slower. Depends on the sentence. Depends on the dictionary (say, the Russian vs the English dict). The algorithms are different; the performance profile is different. If you are starting from scratch, then the SAT solver is "good enough". In LG, there is no incentive to pursue it; the SAT solver is not as flexible, it has trouble with parse ranking. I mean, there is nothing wrong with minisat as a SAT solver. It's fine. What's wrong is the idea that SAT is suitable for parsing. It's not. Basically, if you assign each grammar rule a probability, (or a probability and a confidence) then you would like to rank parses according to probability. The SAT algo's don't know how to avoid exploration of low-probability regions. Its a classic combinatorial-explosion problem: you have to perform hill-climbing, or simulated annealing, or whatever, as you parse, and trim the search space as much as possible. SAT algo's don't do that -- they explore the entire search space, and return a yes-no answer. You don't want a yes/no answer, you want a likelihood. SAT solvers choke on any problem that has a combinatoric explosion. There is an interesting middle-ground, by stealing the key SAT idea. The key idea behind SAT is: prune away all trees, all DAG's, and then exhaustively explore the remaining multiply-connected, combinatorially-explosive core. SAT is "fast" because "most" (!??) problems, after pruning, have tiny cores. So, likewise, in the "ideal parser", one would prune away all branches, and then explore the combintatorially-explosive core with your favorite combinatorial-explorer: hill-climbing, simulated-annealing, bayesian-net, neural-net, genetic algo, or whatever. I've been trying to get Ben and others interested in this idea for umpteen years, but no luck yet. I'm even writing grant proposals. But no one understands what I say 😢 I've been told that it's either "trivially obvious" or its "gibberish" ... I can't find the middle ground... > > Thanks for the enlightening conversation. I was imagining such a thing > but did not have time to look into aspell intricates. > There is nothing at all intricate about aspell. You give it a mis-spelled word, and it offers up alternative suggestions. It's based on ispell, which is maybe 40 years old, now, or more? Pre-internet software. > It seems to me learning grammar in > an unsupervised way, is not ready? > It's not ready. I encountered two very different problems. In some of the learned grammars, a large number of alternatives are produced, and the existing link-grammar parser is too slow. There are 5 or 10-word sentences that take a day or two to parse, because the lexis contains 10K alternatives per word. If you have the patience to wait that long, the resulting parse looks pretty good ....This is the combinatorial explosion I was talking about. Of course, I don't have to generate those kinds of dictionaries -- there are others with parse very quickly, but with lower accuracy. The other, more important problem is the lack of a reference standard. The solution to that is obvious: treat parsing as 1/2 of a transducer-pair, with a generator as the other 1/2. Think of, for example, a microphone and a loudspeaker. Ideally, you push a sine-wave through a loudspeaker, and get a sine wave back on the microphone. You cannot measure the quality of the microphone .. or the loudspeaker ... without this. The solution is to generate random grammars, consisting of N words, and M grammatical rules, and a Zipfian or other probability distribution for the use of words/rules. Then generate a corpus of K "random" sentences. Now apply the learning tool ... can you learn all N words, all M rules correctly? How does the accuracy depend on the size K of the corpus? How fast can one learn, as a function of N, M, and the word-distribution? I've started to build that generator, but I am not done yet. Trying to use human-annotated corpora gives garbage. Its a beginner mistake. --linas -- Verbogeny is one of the pleasurettes of a creatific thinkerizer. --Peter da Silva -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37t2DrdJs_WPQBhGzRs31S2m3HrXmnOW0s2AnZC3gUTjw%40mail.gmail.com.
