Grant Ingersoll wrote:
Are you altering (stemming) the token before it gets to the StopFilter?

On Dec 26, 2007, at 5:08 PM, Liaqat Ali wrote:

Doron Cohen wrote:
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote:


Using javac -encoding UTF-8 still raises the following error.

urduIndexer.java : illegal character: \65279
?
^
1 error

What I am doing wrong?



If you have the stop-words in a file, say one word in a line,
they can be read like this:

   BufferedReader r = new BufferedReader(new InputStreamReader(new
FileInputStream("Urdu.txt"),"UTF8"));
String word = r.readLine(); // loop this line, you get the picture

(Make sure to specify encoding "UTF8" when saving the file from notepad).

Regards,
Doron


Hi, Doron

The compilation problem is solved, but there is no change in the index.


public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی" ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Again these words are appeared in the index with high ranks.

Regards,
Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


No, at this level I am not using any stemming technique. I am just trying to eliminate stop words.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to