I've been doing some testing with the newest version of PDFBox to see
if the "PDF parsing hanging" bug still exists.

Unfortunately, even with the newer version, PDF parsing still hangs
and grinds the fetching process to a halt.  I've been able to
replicate the bug by using PDFBox's ExtractText command line tool with
these pdf documents:

http://www.ptc.com/support/notice/ProI_Quality41_CH.pdf
http://www.qualstar.com/WORM_Tape.pdf
http://www.ptc.com/support/notice/ProI_Quality41_CS.pdf
http://www.oig.hhs.gov/oas/reports/region3/30100023.pdf
http://www.epa.gov/smartgrowth/pdf/built.pdf

I've filed a bug report with PDFBox, so hopefully this issue will be
resolved soon.

Nevertheless, perhaps we should implement a timeout for fetcher/parser
threads to protect against renegade plugins?  Maybe if a fetcher
thread hasn't checked in after X seconds, the thread is interrupted.

Andy

Reply via email to