Re: what is needed to index for about 10000 domains

Eric J. Christeson Wed, 04 Mar 2009 08:32:21 -0800


On Mar 3, 2009, at 10:32 PM, Jasper Kamperman wrote:

There is a way to tell nutch to look at only the beginning of a file, it's this section in your config.xml:
<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>
this is from the nutch-default.xml in 0.9, don't know whether it has changed in 1.0 .

This might also depend upon what type of files you're trying to index. We ended up using -1 for unlimited after running into some 15MB pdf files. The pdf parser would barf if it didn't get the whole file. This was with 0.9, don't know if 1.0 includes


Eric

--

Eric J. Christeson <[email protected]>

Enterprise Computing and Infrastructure    (701) 231-8693 (Voice)
North Dakota State University

PGP.sig
Description: This is a digitally signed message part

Re: what is needed to index for about 10000 domains

Reply via email to