Good day guys,
hope u can help me. I am trying to index French and Russian documents with
Lucene and have no luck. I am new in JAVA so basically I really need your
help.
I was able to get text from pdfs, when I save it its all fine I can clearly
see russian charachters in txt file but when I add it to the Index its all
??? or other garbage.
Here is what I do:
I first use PDF box to extract text.
[CODE]
textFile = "c:/java/faq.txt";
pdfFile = "c:/java/faq.pdf";
//FIRST I AM GETTING TEXT FROM PDF
document = PDDocument.load( pdfFile );
output = new OutputStreamWriter ( new FileOutputStream ( textFile ),
"UTF-8" );
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 1 );
stripper.setEndPage( 20 );
//THIS SAVES TEXT INTO THE TXT FILE, TXT FILE COMPLETELY FINE
stripper.writeText(document, output);
//BUT WHEN I GET TEXT LIKE THAT TO ADD TO THE INDEX
textData = stripper.getText(document);
Analyzer analyzer = new StandardAnalyzer();
Directory directory = FSDirectory.getDirectory("c:/java/collection");
IndexWriter iwriter = new IndexWriter(directory, analyzer, new
IndexWriter.MaxFieldLength(250));
Document doc = new Document();
doc.add(new Field("fieldname", textData, Field.Store.YES,
Field.Index.NOT_ANALYZED));
iwriter.addDocument(doc);
iwriter.optimize();
iwriter.close();
[/CODE]
This code above properly saves extracted text to the txt file, whioch I dotn
really need. What I want is to get text and add it to the Index right away.
When I open index files in notepad I can see garbage instead of russian
characters.
Please help. Thank you
--
View this message in context:
http://www.nabble.com/Lucene-Indexer-Encoding-problem-tp19959504p19959504.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]