My general experience is that the PDF parse process is nutchs Achilles heel.
Nutch works fine on older computers, but with the combination of |parse-(text|html|pdf) and http.content.limit = -1(needed to get PDF parsing to work) nutch sometimes freezes completely.
Is there planned any improvement to the parsing of PDF files in the next version of nutch (0.8)?
