[
https://issues.apache.org/jira/browse/LUCENE-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423651#comment-13423651
]
Nattapong Sirilappanich commented on LUCENE-4253:
-------------------------------------------------
Hi Reobrt,
Stop words will only be useful when it is able to deal with correct
tokenization.
The problem, as stated in the thesis, is the tokenization process can never
give a 100% correct result by any todate technology.
I'd give it a try for the approach in the thesis but it'd be risky if it
doesn't deliver what it promised in thesis.
My preference now is to use no stop word at all to avoid potential problems.
An example problem is a word "คงอยู่" (Two syllables Thai word mean persisting
and surviving).
It will be segmented into "คง" (mean may, might and probably in English) and
"อยู่" (mean stay, live and reside in English). By using the existing stop
word, there is no way to find this word. By using the new stop words in the
thesis, the term "คง" is the only way to find the word which is not going to
make sense. How come the word which mean "might" return a result with the word
meaning "survive" ?
> ThaiAnalyzer fail to tokenize word.
> -----------------------------------
>
> Key: LUCENE-4253
> URL: https://issues.apache.org/jira/browse/LUCENE-4253
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: Realtime Branch
> Environment: Windows 7 SP1.
> Java 1.7.0-b147
> Reporter: Nattapong Sirilappanich
>
> Method
> protected TokenStreamComponents createComponents(String,Reader)
> return a component that unable to tokenize Thai word.
> The current return statement is:
> return new TokenStreamComponents(source, new StopFilter(matchVersion,
> result, stopwords));
> My experiment is change the return statement to:
> return new TokenStreamComponents(source, result);
> It give me a correct result.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]