[jira] [Commented] (LUCENE-4253) ThaiAnalyzer fail to tokenize word.

Robert Muir (JIRA) Thu, 26 Jul 2012 20:40:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423655#comment-13423655
 ]


Robert Muir commented on LUCENE-4253:
-------------------------------------

Right but having less than 100% segmentation isnt unique to thai (it happens in 
many other languages too).

Its always a tradeoff: if those measurements are correct and 30% of typical 
thai text is stopwords,
then its a pretty significant performance (and often relevance) degradation to 
keep all stopwords.

In general these list are useful, someone can also choose to use them with 
commongrams filter for maybe 
an even better tradeoff. Thats why I think its good to keep them (of course as 
short and minimal as possible).

If someone doesnt mind the downsides, you can always pass 
CharArraySet.EMPTY_SET parameter as I mentioned before.
 
                
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>
>                 Key: LUCENE-4253
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4253
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: Realtime Branch
>         Environment: Windows 7 SP1.
> Java 1.7.0-b147
>            Reporter: Nattapong Sirilappanich
>
> Method 
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,        
> result, stopwords));
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4253) ThaiAnalyzer fail to tokenize word.

Reply via email to