I bet this is the same old problem with a new name http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg02673.html I think its cause by http.content.limit -1, I did some tests with standard value and the fetch was fine.
Does anyone know if pdf parse in 0.8-dev works with another value than -1?



Michael Nebel wrote:

Hi,

I can reproduce the problem with the latest version out of svn. :-( I played arround a little bit (most of the day in fact :-) and after increasing the parameters

    <property>
      <name>mapred.task.timeout</name>
      <value>6000000</value>
      <description>The number of milliseconds before a task will be
      terminated if it neither reads an input, writes an output, nor
      updates its status string.
      </description>
    </property>


    <property>
      <name>mapred.child.heap.size</name>
      <value>2000m</value>
<description>The heap size (-Xmx) that will be used for task tracker child processes.</description>
    </property>

the error seems to disappear. But I don't understand why. It's just some "guessing in the dark".

    Michael



Håvard W. Kongsgård wrote:

Hi, I have a problem with last Friday nightly build. When I try to fetch my segment the fetch process freezes"Aborting with 10 hung threads". After failing Nutch tries to run the same urls on another tasktracker but again fails.

I have tried turning fetcher.parse off, protocol-httpclient, protocol-http.

nutch-site.xml

<property>
 <name>fs.default.name</name>
 <value>linux3:50000</value>
 <description>The name of the default file system.  Either the
 literal string "local" or a host:port for NDFS.</description>
</property>

<property>
 <name>mapred.job.tracker</name>
 <value>linux3:50020</value>
 <description>The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
</property>

<property>
 <name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

<property>
 <name>http.content.limit</name>
 <value>-1</value>
 <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
 otherwise, no truncation at all.
 </description>
</property>

<property>
 <name>fetcher.parse</name>
 <value>false</value>
 <description>If true, fetcher will parse content.</description>
</property>






-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to