Kevin Lawson created LUCENE-4947:
------------------------------------
Summary: Java implementation (and improvement) of Levenshtein &
associated lexicon automata
Key: LUCENE-4947
URL: https://issues.apache.org/jira/browse/LUCENE-4947
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 4.2.1, 4.2, 4.1, 4.0, 4.0-BETA, 4.0-ALPHA
Reporter: Kevin Lawson
I was encouraged by Mike McCandless to open an issue concerning this after I
contacted him privately about it. Thanks Mike!
I'd like to submit my Java implementation of the Levenshtein Automaton as a
homogenous replacement for the current heterogenous, multi-component
implementation in Lucene.
Benefits of upgrading include
- Reduced code complexity
- Better performance from components that were previously implemented in Python
- Support for on-the-fly dictionary-automaton manipulation (if you wish to use
my dictionary-automaton implementation)
The code for all the components is well structured, easy to follow, and
extensively commented. It has also been fully tested for correct functionality
and performance.
The levenshtein automaton implementation (along with the required MDAG
reference) can be found in my LevenshteinAutomaton Java library here:
https://github.com/klawson88/LevenshteinAutomaton.
The minimalistic directed acyclic graph (MDAG) which the automaton code uses to
store and step through word sets can be found here:
https://github.com/klawson88/MDAG
*Transpositions aren't currently implemented. I hope the comment filled,
editing-friendly code combined with the fact that the section in the Mihov
paper detailing transpositions is only 2 pages makes adding the functionality
trivial.
*As a result of support for on-the-fly manipulation, the MDAG
(dictionary-automaton) creation process incurs a slight speed penalty. In order
to have the best of both worlds, i'd recommend the addition of a constructor
which only takes sorted input. The complete, easy to follow pseudo-code for the
simple procedure can be found in the first article I linked under the
references section in the MDAG repository)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]