Jack Krupansky created LUCENE-5785:
--------------------------------------
Summary: White space tokenizer has undocumented limit of 256
characters per token
Key: LUCENE-5785
URL: https://issues.apache.org/jira/browse/LUCENE-5785
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 4.8.1
Reporter: Jack Krupansky
Priority: Minor
The white space tokenizer breaks tokens at 256 characters, which is a
hard-wired limit of the character tokenizer abstract class.
The limit of 256 is obviously fine for normal, natural language text, but
excessively restrictive for semi-structured data.
1. Document the current limit in the Javadoc for the character tokenizer. Add a
note to any derived tokenizers (such as the white space tokenizer) that token
size is limited as per the character tokenizer.
2. Added the setMaxTokenLength method to the character tokenizer ala the
standard tokenizer so that an application can control the limit. This should
probably be added to the character tokenizer abstract class, and then other
derived tokenizer classes can inherit it.
3. Disallow a token size limit of 0.
4. A limit of -1 would mean no limit.
5. Add a "token limit mode" method - "skip" (what the standard tokenizer does),
"break" (current behavior of the white space tokenizer and its derived
tokenizers), and "trim" (what I think a lot of people might expect.)
6. Not sure whether to change the current behavior of the character tokenizer
(break mode) to fix it to match the standard tokenizer, or to be "trim" mode,
which is my choice and likely to be what people might expect.
7. Add matching attributes to the tokenizer factories for Solr, including Solr
XML javadoc.
At a minimum, this issue should address the documentation problem.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]