[
https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043432#comment-14043432
]
Shawn Heisey edited comment on LUCENE-5785 at 6/25/14 1:24 PM:
---------------------------------------------------------------
A hard limit on the number of characters in a token is probably unavoidable.
Every tokenizer I've actually looked at has a limit. For some it's up to 4096
characters. 256 characters seems REALLY small, even for an abstract base
class. I'd hope for 1024 as a minimum.
Here's a radical idea: Make the limit configurable for all tokenizers, and
expose that config option in the Solr schema.
was (Author: elyograg):
A hard limit on the number of characters in a token is probably unavoidable.
Every tokenizer I've actually looked at has a limit. For some it's up to 4096
bytes. 256 bytes seems REALLY small, even for an abstract base class. I'd
hope for 1024 as a minimum.
Here's a radical idea: Make the limit configurable for all tokenizers, and
expose that config option in the Solr schema.
> White space tokenizer has undocumented limit of 256 characters per token
> ------------------------------------------------------------------------
>
> Key: LUCENE-5785
> URL: https://issues.apache.org/jira/browse/LUCENE-5785
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 4.8.1
> Reporter: Jack Krupansky
> Priority: Minor
>
> The white space tokenizer breaks tokens at 256 characters, which is a
> hard-wired limit of the character tokenizer abstract class.
> The limit of 256 is obviously fine for normal, natural language text, but
> excessively restrictive for semi-structured data.
> 1. Document the current limit in the Javadoc for the character tokenizer. Add
> a note to any derived tokenizers (such as the white space tokenizer) that
> token size is limited as per the character tokenizer.
> 2. Added the setMaxTokenLength method to the character tokenizer ala the
> standard tokenizer so that an application can control the limit. This should
> probably be added to the character tokenizer abstract class, and then other
> derived tokenizer classes can inherit it.
> 3. Disallow a token size limit of 0.
> 4. A limit of -1 would mean no limit.
> 5. Add a "token limit mode" method - "skip" (what the standard tokenizer
> does), "break" (current behavior of the white space tokenizer and its derived
> tokenizers), and "trim" (what I think a lot of people might expect.)
> 6. Not sure whether to change the current behavior of the character tokenizer
> (break mode) to fix it to match the standard tokenizer, or to be "trim" mode,
> which is my choice and likely to be what people might expect.
> 7. Add matching attributes to the tokenizer factories for Solr, including
> Solr XML javadoc.
> At a minimum, this issue should address the documentation problem.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]