I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields.
You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread "main" java.io.IOException: expected='obj' actual='obj<</Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
