Improve the Edge/NGramTokenizer/Filters
---------------------------------------
Key: LUCENE-3907
URL: https://issues.apache.org/jira/browse/LUCENE-3907
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Fix For: 4.0
Our ngram tokenizers/filters could use some love. EG, they output ngrams in
multiple passes, instead of "stacked", which messes up offsets/positions and
requires too much buffering (can hit OOME for long tokens). They clip at 1024
chars (tokenizers) but don't (token filters). The split up surrogate pairs
incorrectly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]