Re: Index entire filesystem

Andrzej Bialecki Wed, 05 Nov 2003 05:48:47 -0800

Stefan Groschupf wrote:

There is some ongoing work for nutch.org.
May be we can bundle all work together?! <open source>
Nutch has alraeady a java *.doc, *.pdf parser as well .

Stefan

I can give you a short account of my experiences with various PDF parsers out there, for the purpose of PDF -> text conversion:

* JPedal: previous versions used LGPL, as opposed to GPL it uses now. In my view, this practically excludes JPedal from commercial applications (which seems to be the point of this change). The library was able to correctly parse ca. 70% of my PDF collection. It is also clear how to extend some of the core algorithms for text grouping so that it can handle more complex layouts. It has a nice image extractor, and rasterizer, which means you can easily provide a preview or create thumbnails. However, it's excruciatingly slow and memory hungry.

* Etymon PJ: GPL :-( I have little experience with this library, because it lacks higher-level API for e.g. text extraction, and this task is non-trivial... There is also a commercial version, but I don't know the details.

* PDFBox: by far the best Java parser around, IMHO. It doesn't handle graphics at all (which might be a disadvantage for some), but does a very good job on text objects. It was able to correctly handle ca. 90% of my PDFs - the ones that failed either used font subsetting or were not quite up to the PDF spec. It can handle encrypted documents as well. However, it is relatively slow and memory hungry.

* Xpdf: not a Java library at all but native executable / library, and also GPL - however, as I understand GPL you are allowed to bundle it with your Java application so long as it is a separate executable (so using JNI is probably out of question...). It is my favorite solution right now for PDF-text conversion - it's very fast, very accurate and able to handle even multi-column layouts. Its Pdf-HTML converter is breath-taking, you have to see it to believe. For large PDFs (around 10MB) PDFBox would take ca. 120 sec for processing, and Xpdf took (including overheads for forking and reading from stdout) ca. 10 secs.

So, as usually there is no ideal solution ... but rather than start yet another PDF parser effort I would encourage you to join efforts with e.g. PDFBox project.

Just my 0.02 PLN on the subject :-)

Regarding the format detection and conversion, I'm still looking for Java implementation of file(1) ... For now, I'm using the following steps to determine the file format:

* by extension - but you can burn your fingers with this. I had a case of a broken web ripper that saved everything (html, images, zips etc) as *.asp files, and obviously my application fell flat when it tried to swallow such files...

* by a subset of tests borrowed from file(1).

* by using fall-back converters when the primary converter fails. Obviously quite expensive. Fall-back converters register themselves for various formats they can handle to varying degrees (including Java implementation of strings(1) command as the last resort).

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index entire filesystem

Reply via email to