Hi Amirouche, There are other, far more sophisticated ways of doing spell checking. The best way is to do context-dependent checking... and link-grammar provides extremely precise context. For example, "I gave him teh hammer" -- the LG spelling guesser offers up "the", "then", "ten" as possible fixes. However, only one of these choices leads to a grammatically-correct sentence. Thus, the other spelling guesses can be discarded because they lead to grammatical nonsense.
Because of this, spelling checkers and POS taggers are NOT used in the language pipeline, and, based on practical experience, they actually make things worse. That is, they make suggestions and provide tags that are misleading or wrong, and lower the quality of the results -- It turns out that English grammar provides tight constraints on what is possible, and those constraints are much tighter (and have higher accuracy/recall/precision) than single-word taggers. I assume that large statistical datasets using multi-word phrases are also possible. (I presume that the excruciatingly annoying Grammarly uses such a system). Such statistical systems might currently do as well as or better than grammar, since, it turns out, writing a complete grammar is impossible; there is always yet one more, extremely rare exception to every rule. Roughly speaking, grammar rules have a Zipfian distribution. ========= "A sat solver and an okvs" -- we've played with SAT. Basically, the SAT solvers are slower. They used to be sometimes-faster, but Amir's work fixed that. I don't know what OKVS is. Please realize that the grammar rules have NOTHING AT ALL to do with English, or any other natural language -- its a generic structural property that applies to graphs. This includes biology, biochemistry, and 3D conformance structures -- For example, zebrafish provide a good way of studying immunoglobulin, and one can compare mutual information to Ising models, to mechanical shape conformance. Although it looks just like link-grammar, there are NO "words" in here! You could maybe use a SAT solver by trying different levels of a "potential" for interaction. I've started work on a "generic" grammar, that can handle both human language, and also biology, here: https://github.com/opencog/generate It's at version 0.1.0 .. its in C++ but Anton is toying with the idea of porting it to Java. (he wants to run it on cell phones) I had wanted to/hoped I could recycle some of the LG code base for this more-generic parsing/generation system, but I found that too hard, and a clean-room implementation was easier. That said, the algorithms are difficult, and its version 0.1.0 because there are obvious avenues for improvements/enhancements. -- Linas p.s. LG is written in C, and so it can only use those spellling-guessers that have a C api. The current choices are aspell and hunspell, -- once upon a time, "hunspell" was "better" but seems to no longer be maintained. Again -- aspell is used to provide suggestions, and parsing determines which of these suggestions is correct (if any). The dictionaries themselves also have a handful of corrections built-in. For example. rather_then.#rather_than: [rather_than]bad-spelling; there.#their: [[their.p]0.65]bad-spelling; where the 0.65 is (minus) the log-likelihood. Accents are handled similarly: % Bad German accent vas.#was-v-d: [[was.v-d]0.05]bad-spelling; vas.#what: [[what]0.05]colloquial; das.#this-p: [[this.p]0.05]colloquial; das.#this-d: [[this.d]0.05]colloquial; Also: % Initial unstressed syllable. 'Cause.#because 'Fore.#before 'Fraid.#afraid 'Gainst.#against % Poetic contractions; Shakespearean contractions 'r.#our ’r.#our: [[our]0.5]colloquial; e'en.#even: [even.e]colloquial; e'er.#ever: [ever]colloquial; ha'.#have ha’.#have: [have.v]colloquial; heav’n.#heaven: [heaven.s]colloquial; o'.#of: [of]colloquial; o'.#on: [on]colloquial; o'er.#over: [over]colloquial; There are also some interesting games to be played with phonetics... but these remain mostly unexplored. There are also regex rules for the automatic recognition of unknown words, for Latin terms, and for micro-biology/biochemistry terminology. Regex works, because these tend to have a very regular structure. -- Linas On Tue, Jun 30, 2020 at 6:08 AM Amirouche Boubekki < [email protected]> wrote: > > This is not strictly related to opencog but might come useful if you want > to use it as part of an NLP / NLU pipeline where you need to spell check > and link a given text to a knowledge base. > > So the idea is that you have a text where they might be spelling mistakes. > The easiest option would be to use an existing spell checker like hunspell > / aspell / ispell. The problem with that approach is that anytime you add > items to the knowledge base you need to update the spell checker > dictionary. My idea is to rely on a single source of truth database, that I > can drive from python or scheme. > > It seems to be the most used spell checker in Python is fuzzy-wuzzy. I > tried to use it and here are a few results with timings. As far as I > understand, fuzzy-wuzzy will not compile or preprocess or index the > "choices" before guessing a match which leads to a very big run time: > > > $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv > 10 resaerch > > ('öres', 90)('erc', 90) > ('e', 90) > ('rch', 90) > ('c', 90) > ('c̄', 90) > ('sae', 90) > ('sé', 90) > ('öre', 90) > ('re', 90) > > 26.097001791000366 > > In the above query e and a are swapped, and fuzzywuzzy fail to find even > remotely something similar. Mind the fact that the last line is run time in > seconds. > > $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv > 10 reserch > > ('research', 93) > ('c̄', 90) > ('öre', 90) > ('rc', 90) > ('ré', 90) > ('ser', 90) > ('rese', 90) > ('re', 90) > ('ch', 90) > ('öres', 90) > > 26.26053023338318 > > $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv > 10 research > > ('research', 100) > ('researchy', 94) > ('ré', 90) > ('sear', 90) > ('rê', 90) > ('öres', 90) > ('ar', 90) > ('nonresearcher', 90) > ('c@', 90) > ('unresearched', 90) > > 26.261364459991455 > > As you can see the time to run is very big, and that will become bigger > over time as the KB grows with more words. > > To help with that task I created a hash in the spirit of simhash that > preserve similarity in the prefix of a hash so that it is easy to query in > an Ordered Key-Value Store (OKVS). Here are the same queries using the that > algorithm: > > $ python fuzz.py query 10 resaerch > * most similar according to bbk fuzzbuzz > ** research -2 > 0.011413335800170898 > > > $ python fuzz.py query 10 reserch > * most similar according to bbk fuzzbuzz > ** research -1 > ** resch -2 > ** resercher -2 > 0.011811494827270508 > > > $ python fuzz.py query 10 research > * most similar according to bbk fuzzbuzz > ** research 0 > ** researches -2 > ** researchee -2 > ** researcher -2 > 0.012357711791992188 > > I tried similar queries over wikidata labels, it gives good results under > 250 ms. > > As you can see it is much much faster and the result seems more relevant. > The algorithm can be found at: https://stackoverflow.com/a/58791875/140837 > > I will be glad if someone can try that algorithm in their system? > > Similarly, I will be glad if you can give me pointers on how to evaluate > (precision / recall?) against a gold standard. > > This one step toward the goal of re-implementing link-grammar using only a > sat-solver and an okvs. > > -- > You received this message because you are subscribed to the Google Groups > "opencog" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com > <https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Verbogeny is one of the pleasurettes of a creatific thinkerizer. --Peter da Silva -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35qrkdUb6%3D%2BTD1EtrXsk%2B1S8MCSrs2fw_JHYuHfK0vtnA%40mail.gmail.com.
