Hi Tim,

I would think that this parameter is related to the problem you describe, but 
the default value should allow indexing pages of the size you mention. Did you 
change this parameter?



  The maximum number of tokens that will be indexed for a single field
  in a document. This limits the amount of memory required for
  indexing, so that collections with very large files will not crash
  the indexing process by running out of memory.

  Note that this effectively truncates large documents, excluding
  from the index tokens that occur further in the document. If you
  know your source documents are large, be sure to set this value
  high enough to accomodate the expected size. If you set it to
  -1, then the only limit is your memory, but you should anticipate
  an OutOfMemoryError.

> -----Original Message-----
> From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> Sent: Thursday, 22 April 2010 2:18 AM
> To: nutch-user@lucene.apache.org
> Subject: Is there some arbitrary limit on content stored for use by
> summaries?
> Hey,
> We have a long page that appears in the search results but the summary
> never contains the search terms.  Why is this?
> If we move the text containing the search terms up the page they get
> displayed in the summary so it's obviously related to some limit
> imposed
> somewhere.  I've looked though all the configuration options and none
> appear to change anything that sounds related to this.
> We use Nutch 1.0 and the the page in question is 8.7KB in size.
> Any help please?
> Tim..
> Tim Redding
> Senior Java Developer
> Tribal DDB
> 12 Bishop's Bridge Road
> London W2 6AA
> T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> its registered office situated at 12 Bishops Bridge Road, London W2
> 6AA.
> ______________________________________________________________________
> This e-mail is intended only for the named person or entity to which it
> is addressed and contains valuable business information that is
> privileged, confidential and/or otherwise protected from disclosure.
> Dissemination, distribution or copying of this e-mail or the
> information herein by anyone other than the intended recipient, or an
> employee, or agent responsible for delivering the message to the
> intended recipient, is strictly prohibited. All contents are the
> copyright property of the sender. If you are not the intended
> recipient, you are nevertheless bound to respect the sender's worldwide
> legal rights. We require that unintended recipients delete the e-mail
> and destroy all electronic copies in their system, retaining no copies
> in any media.
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email

Reply via email to