On Mar 3, 2009, at 10:32 PM, Jasper Kamperman wrote:

There is a way to tell nutch to look at only the beginning of a file, it's this section in your config.xml:

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

this is from the nutch-default.xml in 0.9, don't know whether it has changed in 1.0 .

This might also depend upon what type of files you're trying to index. We ended up using -1 for unlimited after running into some 15MB pdf files. The pdf parser would barf if it didn't get the whole file. This was with 0.9, don't know if 1.0 includes

Eric

--
Eric J. Christeson <[email protected]>
Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University

Attachment: PGP.sig
Description: This is a digitally signed message part

Reply via email to