[jira] [Comment Edited] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Shawn Heisey (JIRA) Wed, 25 Jun 2014 06:26:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043432#comment-14043432
 ]


Shawn Heisey edited comment on LUCENE-5785 at 6/25/14 1:24 PM:
---------------------------------------------------------------

A hard limit on the number of characters in a token is probably unavoidable.  
Every tokenizer I've actually looked at has a limit.  For some it's up to 4096 
characters.  256 characters seems REALLY small, even for an abstract base 
class.  I'd hope for 1024 as a minimum.

Here's a radical idea: Make the limit configurable for all tokenizers, and 
expose that config option in the Solr schema.



was (Author: elyograg):
A hard limit on the number of characters in a token is probably unavoidable.  
Every tokenizer I've actually looked at has a limit.  For some it's up to 4096 
bytes.  256 bytes seems REALLY small, even for an abstract base class.  I'd 
hope for 1024 as a minimum.

Here's a radical idea: Make the limit configurable for all tokenizers, and 
expose that config option in the Solr schema.


> White space tokenizer has undocumented limit of 256 characters per token
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-5785
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5785
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.8.1
>            Reporter: Jack Krupansky
>            Priority: Minor
>
> The white space tokenizer breaks tokens at 256 characters, which is a 
> hard-wired limit of the character tokenizer abstract class.
> The limit of 256 is obviously fine for normal, natural language text, but 
> excessively restrictive for semi-structured data.
> 1. Document the current limit in the Javadoc for the character tokenizer. Add 
> a note to any derived tokenizers (such as the white space tokenizer) that 
> token size is limited as per the character tokenizer.
> 2. Added the setMaxTokenLength method to the character tokenizer ala the 
> standard tokenizer so that an application can control the limit. This should 
> probably be added to the character tokenizer abstract class, and then other 
> derived tokenizer classes can inherit it.
> 3. Disallow a token size limit of 0.
> 4. A limit of -1 would mean no limit.
> 5. Add a "token limit mode" method - "skip" (what the standard tokenizer 
> does), "break" (current behavior of the white space tokenizer and its derived 
> tokenizers), and "trim" (what I think a lot of people might expect.)
> 6. Not sure whether to change the current behavior of the character tokenizer 
> (break mode) to fix it to match the standard tokenizer, or to be "trim" mode, 
> which is my choice and likely to be what people might expect.
> 7. Add matching attributes to the tokenizer factories for Solr, including 
> Solr XML javadoc.
> At a minimum, this issue should address the documentation problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5785) White space tokenizer has undocumented limit of 256 characters per token

Reply via email to