I've been doing some testing with the newest version of PDFBox to see if the "PDF parsing hanging" bug still exists.
Unfortunately, even with the newer version, PDF parsing still hangs and grinds the fetching process to a halt. I've been able to replicate the bug by using PDFBox's ExtractText command line tool with these pdf documents: http://www.ptc.com/support/notice/ProI_Quality41_CH.pdf http://www.qualstar.com/WORM_Tape.pdf http://www.ptc.com/support/notice/ProI_Quality41_CS.pdf http://www.oig.hhs.gov/oas/reports/region3/30100023.pdf http://www.epa.gov/smartgrowth/pdf/built.pdf I've filed a bug report with PDFBox, so hopefully this issue will be resolved soon. Nevertheless, perhaps we should implement a timeout for fetcher/parser threads to protect against renegade plugins? Maybe if a fetcher thread hasn't checked in after X seconds, the thread is interrupted. Andy
