[
https://issues.apache.org/jira/browse/LUCENE-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423655#comment-13423655
]
Robert Muir commented on LUCENE-4253:
-------------------------------------
Right but having less than 100% segmentation isnt unique to thai (it happens in
many other languages too).
Its always a tradeoff: if those measurements are correct and 30% of typical
thai text is stopwords,
then its a pretty significant performance (and often relevance) degradation to
keep all stopwords.
In general these list are useful, someone can also choose to use them with
commongrams filter for maybe
an even better tradeoff. Thats why I think its good to keep them (of course as
short and minimal as possible).
If someone doesnt mind the downsides, you can always pass
CharArraySet.EMPTY_SET parameter as I mentioned before.
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>
> Key: LUCENE-4253
> URL: https://issues.apache.org/jira/browse/LUCENE-4253
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: Realtime Branch
> Environment: Windows 7 SP1.
> Java 1.7.0-b147
> Reporter: Nattapong Sirilappanich
>
> Method
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,
> result, stopwords));
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]