Are you altering (stemming) the token before it gets to the StopFilter?
On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:
Doron Cohen wrote:
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:
Using javac -encoding UTF-8 still raises the following error.
urduIndexer.java : illegal character: \65279
?
^
1 error
What I am doing wrong?
If you have the stop-words in a file, say one word in a line,
they can be read like this:
BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
String word = r.readLine(); // loop this line, you get the
picture
(Make sure to specify encoding "UTF8" when saving the file from
notepad).
Regards,
Doron
Hi, Doron
The compilation problem is solved, but there is no change in the
index.
public static final String[] URDU_STOP_WORDS =
{ "کی
" ,"کا
" ,"کو
" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان
" ,"ایک
" ,"تھا
" ,"تھی
" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
Again these words are appeared in the index with high ranks.
Regards,
Liaqat
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]