[ 
https://issues.apache.org/jira/browse/LUCENE-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422970#comment-13422970
 ] 

Nattapong Sirilappanich commented on LUCENE-4253:
-------------------------------------------------

Hi Robert,

Based on your suggestion, i found the actual problem.
The problem is "stopwords.txt" in package "org.apache.lucene.analysis.th" 
contain a lot of words that is stop words for a specific type of usage. The 
only type of usage is already stated inside the file.
And based on the javadoc, since Lucene 3.6, these words are being used by 
default.

In my opinion, these set of words shall not be used by default.
                
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>
>                 Key: LUCENE-4253
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4253
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: Realtime Branch
>         Environment: Windows 7 SP1.
> Java 1.7.0-b147
>            Reporter: Nattapong Sirilappanich
>
> Method 
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,        
> result, stopwords));
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to