[jira] Commented: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN parametrizable

Uwe Schindler (JIRA) Wed, 21 Apr 2010 07:40:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859373#action_12859373
 ]


Uwe Schindler commented on LUCENE-2407:
---------------------------------------

This is also a problem for some asian languaes. If ThaiAnalyzer would use 
CharTokenizer, very long passages could get lost, as ThatWordFilter would not 
get the complete string (thai is not tokenized by the tokenizer, but later in 
the filter)

This also applies to StandardTokenizer, maybe we should set a good default when 
analyzing Thai text (ThaiAnalyzer should init StandardTokenizer with a 
large/infinite value).

> make CharTokenizer.MAX_WORD_LEN parametrizable
> ----------------------------------------------
>
>                 Key: LUCENE-2407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2407
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0.1
>            Reporter: javi
>            Priority: Minor
>             Fix For: 3.1
>
>
> as discussed here 
> http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html
>  it would be nice to be able to parametrize that value. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN parametrizable

Reply via email to