I bet this is the same old problem with a new name
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg02673.html
I think its cause by http.content.limit -1, I did some tests with
standard value and the fetch was fine.
Does anyone know if pdf parse in 0.8-dev works with another value than -1?
Michael Nebel wrote:
Hi,
I can reproduce the problem with the latest version out of svn. :-( I
played arround a little bit (most of the day in fact :-) and after
increasing the parameters
<property>
<name>mapred.task.timeout</name>
<value>6000000</value>
<description>The number of milliseconds before a task will be
terminated if it neither reads an input, writes an output, nor
updates its status string.
</description>
</property>
<property>
<name>mapred.child.heap.size</name>
<value>2000m</value>
<description>The heap size (-Xmx) that will be used for task
tracker child processes.</description>
</property>
the error seems to disappear. But I don't understand why. It's just
some "guessing in the dark".
Michael
Håvard W. Kongsgård wrote:
Hi, I have a problem with last Friday nightly build. When I try to
fetch my segment the fetch process freezes"Aborting with 10 hung
threads".
After failing Nutch tries to run the same urls on another tasktracker
but again fails.
I have tried turning fetcher.parse off, protocol-httpclient,
protocol-http.
nutch-site.xml
<property>
<name>fs.default.name</name>
<value>linux3:50000</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>linux3:50020</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>fetcher.parse</name>
<value>false</value>
<description>If true, fetcher will parse content.</description>
</property>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general