Good day guys, hope u can help me. I am trying to index French and Russian documents with Lucene and have no luck. I am new in JAVA so basically I really need your help.
I was able to get text from pdfs, when I save it its all fine I can clearly see russian charachters in txt file but when I add it to the Index its all ??? or other garbage. Here is what I do: I first use PDF box to extract text. [CODE] textFile = "c:/java/faq.txt"; pdfFile = "c:/java/faq.pdf"; //FIRST I AM GETTING TEXT FROM PDF document = PDDocument.load( pdfFile ); output = new OutputStreamWriter ( new FileOutputStream ( textFile ), "UTF-8" ); PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage( 1 ); stripper.setEndPage( 20 ); //THIS SAVES TEXT INTO THE TXT FILE, TXT FILE COMPLETELY FINE stripper.writeText(document, output); //BUT WHEN I GET TEXT LIKE THAT TO ADD TO THE INDEX textData = stripper.getText(document); Analyzer analyzer = new StandardAnalyzer(); Directory directory = FSDirectory.getDirectory("c:/java/collection"); IndexWriter iwriter = new IndexWriter(directory, analyzer, new IndexWriter.MaxFieldLength(250)); Document doc = new Document(); doc.add(new Field("fieldname", textData, Field.Store.YES, Field.Index.NOT_ANALYZED)); iwriter.addDocument(doc); iwriter.optimize(); iwriter.close(); [/CODE] This code above properly saves extracted text to the txt file, whioch I dotn really need. What I want is to get text and add it to the Index right away. When I open index files in notepad I can see garbage instead of russian characters. Please help. Thank you -- View this message in context: http://www.nabble.com/Lucene-Indexer-Encoding-problem-tp19959504p19959504.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]