I bet this is the same old problem with a new name http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg02673.html I think its cause by http.content.limit -1, I did some tests with standard value and the fetch was fine.
Does anyone know if pdf parse in 0.8-dev works with another value than -1?



Michael Nebel wrote:

Hi,

I can reproduce the problem with the latest version out of svn. :-( I played arround a little bit (most of the day in fact :-) and after increasing the parameters

    <property>
      <name>mapred.task.timeout</name>
      <value>6000000</value>
      <description>The number of milliseconds before a task will be
      terminated if it neither reads an input, writes an output, nor
      updates its status string.
      </description>
    </property>


    <property>
      <name>mapred.child.heap.size</name>
      <value>2000m</value>
<description>The heap size (-Xmx) that will be used for task tracker child processes.</description>
    </property>

the error seems to disappear. But I don't understand why. It's just some "guessing in the dark".

    Michael



Håvard W. Kongsgård wrote:

Hi, I have a problem with last Friday nightly build. When I try to fetch my segment the fetch process freezes"Aborting with 10 hung threads". After failing Nutch tries to run the same urls on another tasktracker but again fails.

I have tried turning fetcher.parse off, protocol-httpclient, protocol-http.

nutch-site.xml

<property>
 <name>fs.default.name</name>
 <value>linux3:50000</value>
 <description>The name of the default file system.  Either the
 literal string "local" or a host:port for NDFS.</description>
</property>

<property>
 <name>mapred.job.tracker</name>
 <value>linux3:50020</value>
 <description>The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
</property>

<property>
 <name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

<property>
 <name>http.content.limit</name>
 <value>-1</value>
 <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
 otherwise, no truncation at all.
 </description>
</property>

<property>
 <name>fetcher.parse</name>
 <value>false</value>
 <description>If true, fetcher will parse content.</description>
</property>




Reply via email to