Ben,
I downloaded pdfbox and installed it. And I can use:
java org.pdfbox.Main <PDF-file> <output-text-file>
to convert .pdf file to string file.
Then I tried to integrate with Lucene. I modified the following codes in
IndexHTML.java:
else if(file.getPath().endsWith(".pdf")) {
Document doc = LucenePDFDocument.getDocument(file);
System.out.println("adding " + "pdf files");
writer.addDocument(doc);
}
It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..
It seems to me it still didnot pick up new IndexHTML.java, still did not index
.pdf files.
Did I miss something here?
Regards,
George
>===== Original Message From Lucene Users List
<[EMAIL PROTECTED]> =====
>I would like to announce the next release of PDFBox. PDFBox allows for
>PDF documents to be indexed using lucene through a simple interface.
>Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
>which will extract all text and PDF document summary properties as lucene
>fields.
>
>You can obtain the latest release from http://www.pdfbox.org
>
>Please send all bug reports to me and attach the PDF document when
>possible.
>
>RELEASE 0.6.0
>-Massive improvements to memory footprint.
>-Must call close() on the COSDocument(LucenePDFDocument does this for you)
>-Really fixed the bug where small documents were not being indexed.
>-Fixed bug where no whitespace existed between obj and start of object.
> Exception in thread "main" java.io.IOException: expected='obj'
> actual='obj<</Pro
>-Fixed issue with spacing where textLineMatrix was not being copied
> properly
>-Fixed 'bug' where parsing would fail with some pdfs with double endobj
> definitions
>-Added PDF document summary fields to the lucene document
>
>
>Thank you,
>Ben Litchfield
>http://www.pdfbox.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]