I've been doing some testing with the newest version of PDFBox to see
if the "PDF parsing hanging" bug still exists.

Unfortunately, even with the newer version, PDF parsing still hangs
and grinds the fetching process to a halt.  I've been able to
replicate the bug by using PDFBox's ExtractText command line tool with
these pdf documents:

http://www.ptc.com/support/notice/ProI_Quality41_CH.pdf
http://www.qualstar.com/WORM_Tape.pdf
http://www.ptc.com/support/notice/ProI_Quality41_CS.pdf
http://www.oig.hhs.gov/oas/reports/region3/30100023.pdf
http://www.epa.gov/smartgrowth/pdf/built.pdf

I've filed a bug report with PDFBox, so hopefully this issue will be
resolved soon.

Nevertheless, perhaps we should implement a timeout for fetcher/parser
threads to protect against renegade plugins?  Maybe if a fetcher
thread hasn't checked in after X seconds, the thread is interrupted.

Andy


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to