I've been doing some testing with the newest version of PDFBox to see if the "PDF parsing hanging" bug still exists.
Unfortunately, even with the newer version, PDF parsing still hangs and grinds the fetching process to a halt. I've been able to replicate the bug by using PDFBox's ExtractText command line tool with these pdf documents: http://www.ptc.com/support/notice/ProI_Quality41_CH.pdf http://www.qualstar.com/WORM_Tape.pdf http://www.ptc.com/support/notice/ProI_Quality41_CS.pdf http://www.oig.hhs.gov/oas/reports/region3/30100023.pdf http://www.epa.gov/smartgrowth/pdf/built.pdf I've filed a bug report with PDFBox, so hopefully this issue will be resolved soon. Nevertheless, perhaps we should implement a timeout for fetcher/parser threads to protect against renegade plugins? Maybe if a fetcher thread hasn't checked in after X seconds, the thread is interrupted. Andy ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
