Indexing Non-English text

Liaqat Ali Tue, 04 Dec 2007 02:54:45 -0800

Hi,

I m facing a problem while indexing a small .txt file with Lucene. Thefile which i want to index with lucene is in Urdu language (varient ofArabic and Persian). But the Index i get is in Unicode form, not in thereal form (original Urdu text). This program works good for a file inEnglish language. This is the code i use for indexing..


       FileReader file = new FileReader ("urdoc.txt");
       BufferedReader buff = new BufferedReader(file);
       String line = buff.readLine();
       boolean eof = false;
       buff.close();
       String indexDir = "D:\\index";
              Analyzer analyzer = new StandardAnalyzer();
           boolean createFlag = true;
       IndexWriter writer =
                   new IndexWriter(indexDir, analyzer, createFlag);
           Document document  = new Document();
       document.add(new Field("fieldname",line, Field.Store.YES,
       Field.Index.TOKENIZED));
           writer.addDocument(document);
           writer.close();

Kindly guide me, what I should do, would i have to change this code orwhatever else do you suggest?


Liaqat

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing Non-English text

Reply via email to