Ben,

I downloaded pdfbox and installed it. And I can use:
 java org.pdfbox.Main <PDF-file> <output-text-file>
to convert .pdf file to string file.

Then I tried to integrate with Lucene. I modified the following codes in 
IndexHTML.java:

else if(file.getPath().endsWith(".pdf")) {
        Document doc =  LucenePDFDocument.getDocument(file);
        System.out.println("adding " + "pdf files");
        writer.addDocument(doc);
        }

It did pass ant compiler (ant wardemo). However, when I tested:
java org.apache.lucene.demo.IndexHTML -create -index {index-dir} ..

It seems to me it still didnot pick up new IndexHTML.java, still did not index 
.pdf files.


Did I miss something here?

Regards,

George

>===== Original Message From Lucene Users List 
<[EMAIL PROTECTED]> =====
>I would like to announce the next release of PDFBox.  PDFBox allows for
>PDF documents to be indexed using lucene through a simple interface.
>Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
>which will extract all text and PDF document summary properties as lucene
>fields.
>
>You can obtain the latest release from http://www.pdfbox.org
>
>Please send all bug reports to me and attach the PDF document when
>possible.
>
>RELEASE 0.6.0
>-Massive improvements to memory footprint.
>-Must call close() on the COSDocument(LucenePDFDocument does this for you)
>-Really fixed the bug where small documents were not being indexed.
>-Fixed bug where no whitespace existed between obj and start of object.
>    Exception in thread "main" java.io.IOException: expected='obj'
>    actual='obj<</Pro
>-Fixed issue with spacing where textLineMatrix was not being copied
> properly
>-Fixed 'bug' where parsing would fail with some pdfs with double endobj
> definitions
>-Added PDF document summary fields to the lucene document
>
>
>Thank you,
>Ben Litchfield
>http://www.pdfbox.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to