I bet this is the same old problem with a new name
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg02673.html
I think its cause by http.content.limit -1, I did some tests with
standard value and the fetch was fine.
Does anyone know if pdf parse in 0.8-dev works with another value than -1?
Michael Nebel wrote:
Hi,
I can reproduce the problem with the latest version out of svn. :-( I
played arround a little bit (most of the day in fact :-) and after
increasing the parameters
<property>
<name>mapred.task.timeout</name>
<value>6000000</value>
<description>The number of milliseconds before a task will be
terminated if it neither reads an input, writes an output, nor
updates its status string.
</description>
</property>
<property>
<name>mapred.child.heap.size</name>
<value>2000m</value>
<description>The heap size (-Xmx) that will be used for task
tracker child processes.</description>
</property>
the error seems to disappear. But I don't understand why. It's just
some "guessing in the dark".
Michael
Håvard W. Kongsgård wrote:
Hi, I have a problem with last Friday nightly build. When I try to
fetch my segment the fetch process freezes"Aborting with 10 hung
threads".
After failing Nutch tries to run the same urls on another tasktracker
but again fails.
I have tried turning fetcher.parse off, protocol-httpclient,
protocol-http.
nutch-site.xml
<property>
<name>fs.default.name</name>
<value>linux3:50000</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>linux3:50020</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>fetcher.parse</name>
<value>false</value>
<description>If true, fetcher will parse content.</description>
</property>