Lucene with Khmer ? (Language in cambodia)

Fournaux Nicolas Wed, 24 Jan 2007 02:29:36 -0800

Good morning all (or good afternoon)


I used Lucene many times before, to search text in French Or English. All
worked fine :-)

 

But now I have a new challenge, I need to use Lucene with Khmer (Khmer is
the Cambodia’s language, it looks like Thai or Indian)

 

But it doesn’t work, my code is well executed but it found no results, I
give you my code below

 

I thought UTF-8 is 100% handled by Java and that we have “nothing to do”

 

My code is working fine when I use English words.

 

Thanks in advance for your help :-)

 

Here is my source code :

 

/***************************************************************************
*************************************/

 

public class TestKhmer

{

 

            public static void main(String[] args) throws Exception

            {           

Analyzer analyzer = new StandardAnalyzer();

Directory directory = FSDirectory.getDirectory("C:\\Folder\\indexLucene",
true);

IndexWriter iwriter = new IndexWriter(directory, analyzer, true);

 

iwriter.setMaxFieldLength(25000);

            

                        Document doc = new Document();

                        String text = getContents("C:\\Folder\\file.txt");
// this file was saved as UTF-8 format by UltraEdit , when I open it I see
my Khmer charactere

 

 

                        Field field = new Field("text", text,
Field.Store.YES, Field.Index.TOKENIZED);

                        Field field2 = new Field("filename", "file.txt" ,
Field.Store.YES, Field.Index.TOKENIZED);

                        doc.add(field);

                        doc.add(field2);

                        iwriter.addDocument(doc);

 

                        iwriter.close();

 

                        // Now search the index:

                        IndexSearcher isearcher = new
IndexSearcher(directory);

                        String stringToSearch =
getContents("C:\\Folder\\dataToSearch.txt"); // my search string is located
in a text file, this file was saved as UTF-8 format by UltraEdit, when I
open it I see my Khmer charactere

                        String stringQuery = "text:" + stringToSearch ;

                        QueryParser queryParser = new QueryParser("text" ,
analyzer);

                        Query query = queryParser.parse(stringQuery);

 

                        Hits hits = isearcher.search(query);

 

                        // Iterate through the results:

                        for (int i = 0; i < hits.length(); i++)

                         {

                                    Document hitDoc = hits.doc(i);

                                    System.out.println("Result : " +
hitDoc.get("filename"));

                        }

 

                        isearcher.close();

                        directory.close();

            }

 

private static String getContents(String path) throws Exception

            {

String line = null;

                        StringBuffer sb = new StringBuffer();

                        BufferedReader br;

 

                        try

                        {

                                    br = new BufferedReader( new
InputStreamReader( new FileInputStream(path), "UTF-8")); // as I told you,
my file is in UTF-8 format

 

while((line = br.readLine()) != null)

{

    sb.append(line + "\n");

}

 

                        } catch (Exception e)

{

 

e.printStackTrace();

}

 

                                    return sb.toString();

}

 

}

 

/***************************************************************************
*************************************/

 

Best regards

 

Nicolas


-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.410 / Virus Database: 268.17.8/649 - Release Date: 1/23/2007

Lucene with Khmer ? (Language in cambodia)

Reply via email to