Re: [opencog-dev] Fast spell checking or entity linking with bbkh

Linas Vepstas Tue, 30 Jun 2020 11:04:13 -0700

Hi Amirouche,

There are other, far more sophisticated ways of doing spell checking. The
best way is to do context-dependent checking... and link-grammar provides
extremely precise context. For example, "I gave him teh hammer" -- the LG
spelling guesser offers up "the", "then", "ten" as possible fixes. However,
only one of these choices leads to a grammatically-correct sentence. Thus,
the other spelling guesses can be  discarded because they lead to
grammatical nonsense.

Because of this, spelling checkers and POS taggers are NOT used in the
language pipeline, and, based on practical experience, they actually make
things worse.  That is, they make suggestions and provide tags that are
misleading or wrong, and lower the quality of the results -- It turns out
that English grammar provides tight constraints on what is possible, and
those constraints are much tighter (and have higher
accuracy/recall/precision) than single-word taggers.

I assume that large statistical datasets using multi-word phrases are also
possible. (I presume that the excruciatingly annoying Grammarly uses such a
system).  Such statistical systems might currently do as well as or better
than grammar, since, it turns out, writing a complete grammar is
impossible; there is always yet one more, extremely rare exception to every
rule.  Roughly speaking, grammar rules have a Zipfian distribution.

=========

"A sat solver and an okvs" -- we've played with SAT. Basically, the SAT
solvers are slower. They used to be sometimes-faster, but Amir's work fixed
that.  I don't know what OKVS is.

Please realize that the grammar rules have NOTHING AT ALL to do with
English, or any other natural language -- its a generic structural property
that applies to graphs. This includes biology, biochemistry, and 3D
conformance structures -- For example, zebrafish provide a good way of
studying immunoglobulin, and one can compare mutual information to Ising
models, to mechanical shape conformance. Although it looks just like
link-grammar, there are NO "words" in here!  You could maybe use a SAT
solver by trying different levels of a "potential" for interaction.

I've started work on a "generic" grammar, that can handle both human
language, and also biology, here:  https://github.com/opencog/generate
It's at version 0.1.0 .. its in C++ but Anton is toying with the idea of
porting it to Java.  (he wants to run it on cell phones)

I had wanted to/hoped I could recycle some of the LG code base for this
more-generic parsing/generation system, but I found that too hard, and a
clean-room implementation was easier. That said, the algorithms are
difficult, and its version 0.1.0 because there are obvious avenues for
improvements/enhancements.

-- Linas

p.s. LG is written in C, and so it can only use those spellling-guessers
that have a C api.  The current choices are aspell and hunspell, -- once
upon a time, "hunspell" was "better" but seems to no longer be maintained.
Again -- aspell is used to provide suggestions, and parsing determines
which of these suggestions is correct (if any).

The dictionaries themselves also have a handful of corrections built-in.
For example.

rather_then.#rather_than: [rather_than]bad-spelling;
there.#their: [[their.p]0.65]bad-spelling;

where the 0.65 is (minus) the log-likelihood.

Accents are handled similarly:
% Bad German accent
vas.#was-v-d: [[was.v-d]0.05]bad-spelling;
vas.#what: [[what]0.05]colloquial;
das.#this-p: [[this.p]0.05]colloquial;
das.#this-d: [[this.d]0.05]colloquial;

Also:
% Initial unstressed syllable.
'Cause.#because 'Fore.#before 'Fraid.#afraid 'Gainst.#against

% Poetic contractions; Shakespearean contractions
'r.#our ’r.#our: [[our]0.5]colloquial;
e'en.#even: [even.e]colloquial;
e'er.#ever: [ever]colloquial;
ha'.#have ha’.#have: [have.v]colloquial;
heav’n.#heaven: [heaven.s]colloquial;
o'.#of: [of]colloquial;
o'.#on: [on]colloquial;
o'er.#over: [over]colloquial;

There are also some interesting games to be played with phonetics... but
these remain mostly unexplored.

There are also regex rules for the automatic recognition of unknown words,
for Latin terms, and for micro-biology/biochemistry terminology. Regex
works, because these tend to have a very regular structure.

-- Linas

On Tue, Jun 30, 2020 at 6:08 AM Amirouche Boubekki <
[email protected]> wrote:

>
> This is not strictly related to opencog but might come useful if you want
> to use it as part of an NLP / NLU pipeline where you need to spell check
> and link a given text to a knowledge base.
>
> So the idea is that you have a text where they might be spelling mistakes.
> The easiest option would be to use an existing spell checker like hunspell
> / aspell / ispell. The problem with that approach is that anytime you add
> items to the knowledge base you need to update the spell checker
> dictionary. My idea is to rely on a single source of truth database, that I
> can drive from python or scheme.
>
> It seems to be the most used spell checker in Python is fuzzy-wuzzy. I
> tried to use it and here are a few results with timings. As far as I
> understand, fuzzy-wuzzy will not compile or preprocess or index the
> "choices" before guessing a match which leads to a very big run time:
>
>
> $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
> 10 resaerch
>
> ('öres', 90)('erc', 90)
> ('e', 90)
> ('rch', 90)
> ('c', 90)
> ('c̄', 90)
> ('sae', 90)
> ('sé', 90)
> ('öre', 90)
> ('re', 90)
>
> 26.097001791000366
>
> In the above query e and a are swapped, and fuzzywuzzy fail to find even
> remotely something similar. Mind the fact that the last line is run time in
> seconds.
>
> $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
> 10 reserch
>
> ('research', 93)
> ('c̄', 90)
> ('öre', 90)
> ('rc', 90)
> ('ré', 90)
> ('ser', 90)
> ('rese', 90)
> ('re', 90)
> ('ch', 90)
> ('öres', 90)
>
> 26.26053023338318
>
> $ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
> 10 research
>
> ('research', 100)
> ('researchy', 94)
> ('ré', 90)
> ('sear', 90)
> ('rê', 90)
> ('öres', 90)
> ('ar', 90)
> ('nonresearcher', 90)
> ('c@', 90)
> ('unresearched', 90)
>
> 26.261364459991455
>
> As you can see the time to run is very big, and that will become bigger
> over time as the KB grows with more words.
>
> To help with that task I created a hash in the spirit of simhash that
> preserve similarity in the prefix of a hash so that it is easy to query in
> an Ordered Key-Value Store (OKVS). Here are the same queries using the that
> algorithm:
>
> $ python fuzz.py query 10 resaerch
> * most similar according to bbk fuzzbuzz
> ** research      -2
> 0.011413335800170898
>
>
> $ python fuzz.py query 10 reserch
> * most similar according to bbk fuzzbuzz
> ** research      -1
> ** resch      -2
> ** resercher      -2
> 0.011811494827270508
>
>
> $ python fuzz.py query 10 research
> * most similar according to bbk fuzzbuzz
> ** research      0
> ** researches      -2
> ** researchee      -2
> ** researcher      -2
> 0.012357711791992188
>
> I tried similar queries over wikidata labels, it gives good results under
> 250 ms.
>
> As you can see it is much much faster and the result seems more relevant.
> The algorithm can be found at: https://stackoverflow.com/a/58791875/140837
>
> I will be glad if someone can try that algorithm in their system?
>
> Similarly, I will be glad if you can give me pointers on how to evaluate
> (precision / recall?) against a gold standard.
>
> This one step toward the goal of re-implementing link-grammar using only a
> sat-solver and an okvs.
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com
> <https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Verbogeny is one of the pleasurettes of a creatific thinkerizer.
        --Peter da Silva

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA35qrkdUb6%3D%2BTD1EtrXsk%2B1S8MCSrs2fw_JHYuHfK0vtnA%40mail.gmail.com.

Re: [opencog-dev] Fast spell checking or entity linking with bbkh

Reply via email to