This is not strictly related to opencog but might come useful if you want
to use it as part of an NLP / NLU pipeline where you need to spell check
and link a given text to a knowledge base.
So the idea is that you have a text where they might be spelling mistakes.
The easiest option would be to use an existing spell checker like hunspell
/ aspell / ispell. The problem with that approach is that anytime you add
items to the knowledge base you need to update the spell checker
dictionary. My idea is to rely on a single source of truth database, that I
can drive from python or scheme.
It seems to be the most used spell checker in Python is fuzzy-wuzzy. I
tried to use it and here are a few results with timings. As far as I
understand, fuzzy-wuzzy will not compile or preprocess or index the
"choices" before guessing a match which leads to a very big run time:
$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
10 resaerch
('öres', 90)('erc', 90)
('e', 90)
('rch', 90)
('c', 90)
('c̄', 90)
('sae', 90)
('sé', 90)
('öre', 90)
('re', 90)
26.097001791000366
In the above query e and a are swapped, and fuzzywuzzy fail to find even
remotely something similar. Mind the fact that the last line is run time in
seconds.
$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
10 reserch
('research', 93)
('c̄', 90)
('öre', 90)
('rc', 90)
('ré', 90)
('ser', 90)
('rese', 90)
('re', 90)
('ch', 90)
('öres', 90)
26.26053023338318
$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv
10 research
('research', 100)
('researchy', 94)
('ré', 90)
('sear', 90)
('rê', 90)
('öres', 90)
('ar', 90)
('nonresearcher', 90)
('c@', 90)
('unresearched', 90)
26.261364459991455
As you can see the time to run is very big, and that will become bigger
over time as the KB grows with more words.
To help with that task I created a hash in the spirit of simhash that
preserve similarity in the prefix of a hash so that it is easy to query in
an Ordered Key-Value Store (OKVS). Here are the same queries using the that
algorithm:
$ python fuzz.py query 10 resaerch
* most similar according to bbk fuzzbuzz
** research -2
0.011413335800170898
$ python fuzz.py query 10 reserch
* most similar according to bbk fuzzbuzz
** research -1
** resch -2
** resercher -2
0.011811494827270508
$ python fuzz.py query 10 research
* most similar according to bbk fuzzbuzz
** research 0
** researches -2
** researchee -2
** researcher -2
0.012357711791992188
I tried similar queries over wikidata labels, it gives good results under
250 ms.
As you can see it is much much faster and the result seems more relevant.
The algorithm can be found at: https://stackoverflow.com/a/58791875/140837
I will be glad if someone can try that algorithm in their system?
Similarly, I will be glad if you can give me pointers on how to evaluate
(precision / recall?) against a gold standard.
This one step toward the goal of re-implementing link-grammar using only a
sat-solver and an okvs.
--
You received this message because you are subscribed to the Google Groups
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com.