[Nutch-general] Re: Hung threads old problem?

Håvard W. Kongsgård Sun, 29 Jan 2006 14:21:04 -0800

I bet this is the same old problem with a new namehttp://www.mail-archive.com/nutch-user%40lucene.apache.org/msg02673.htmlI think its cause by http.content.limit -1, I did some tests withstandard value and the fetch was fine.

Does anyone know if pdf parse in 0.8-dev works with another value than -1?




Michael Nebel wrote:

Hi,
I can reproduce the problem with the latest version out of svn. :-( Iplayed arround a little bit (most of the day in fact :-) and afterincreasing the parameters
    <property>
      <name>mapred.task.timeout</name>
      <value>6000000</value>
      <description>The number of milliseconds before a task will be
      terminated if it neither reads an input, writes an output, nor
      updates its status string.
      </description>
    </property>


    <property>
      <name>mapred.child.heap.size</name>
      <value>2000m</value>
<description>The heap size (-Xmx) that will be used for tasktracker child processes.</description>
    </property>
the error seems to disappear. But I don't understand why. It's justsome "guessing in the dark".
    Michael



Håvard W. Kongsgård wrote:
Hi, I have a problem with last Friday nightly build. When I try tofetch my segment the fetch process freezes"Aborting with 10 hungthreads".After failing Nutch tries to run the same urls on another tasktrackerbut again fails.
I have tried turning fetcher.parse off, protocol-httpclient,protocol-http.
nutch-site.xml

<property>
 <name>fs.default.name</name>
 <value>linux3:50000</value>
 <description>The name of the default file system.  Either the
 literal string "local" or a host:port for NDFS.</description>
</property>

<property>
 <name>mapred.job.tracker</name>
 <value>linux3:50020</value>
 <description>The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
</property>

<property>
 <name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpointsplugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

<property>
 <name>http.content.limit</name>
 <value>-1</value>
 <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will betruncated;
 otherwise, no truncation at all.
 </description>
</property>

<property>
 <name>fetcher.parse</name>
 <value>false</value>
 <description>If true, fetcher will parse content.</description>
</property>




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Hung threads old problem?

Reply via email to