Try refetching with a different value for :

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

Julien

On 22 April 2010 18:44, Tim Redding <tim.redd...@tribalddb.co.uk> wrote:

> Hey Arkadi,
>
> I've tried upping the value to Integer.MAX_VALUE but it still doesn't
> show a relevant summary. :-(
>
> Any other ideas?
>
>
>
> Tim..
>
>
>
>
>
>
> -----Original Message-----
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: 21 April 2010 23:29
> To: nutch-user@lucene.apache.org
> Subject: RE: Is there some arbitrary limit on content stored for use by
> summaries?
>
> Hi Tim,
>
> I would think that this parameter is related to the problem you
> describe, but the default value should allow indexing pages of the size
> you mention. Did you change this parameter?
>
> Regards,
>
> Arkadi
>
> <property>
>  <name>indexer.max.tokens</name>
>  <value>10000</value>
>  <description>
>  The maximum number of tokens that will be indexed for a single field
>  in a document. This limits the amount of memory required for
>  indexing, so that collections with very large files will not crash
>  the indexing process by running out of memory.
>
>  Note that this effectively truncates large documents, excluding
>  from the index tokens that occur further in the document. If you
>  know your source documents are large, be sure to set this value
>  high enough to accomodate the expected size. If you set it to
>  -1, then the only limit is your memory, but you should anticipate
>  an OutOfMemoryError.
>  </description>
> </property>
>
> > -----Original Message-----
> > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> > Sent: Thursday, 22 April 2010 2:18 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Is there some arbitrary limit on content stored for use by
> > summaries?
> >
> > Hey,
> >
> > We have a long page that appears in the search results but the summary
> > never contains the search terms.  Why is this?
> >
> > If we move the text containing the search terms up the page they get
> > displayed in the summary so it's obviously related to some limit
> > imposed
> > somewhere.  I've looked though all the configuration options and none
> > appear to change anything that sounds related to this.
> >
> > We use Nutch 1.0 and the the page in question is 8.7KB in size.
> >
> >
> > Any help please?
> >
> >
> > Tim..
> >
> >
> >
> >
> >
> >
> > Tim Redding
> > Senior Java Developer
> > Tribal DDB
> > 12 Bishop's Bridge Road
> > London W2 6AA
> > T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> >
> >
> > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> > its registered office situated at 12 Bishops Bridge Road, London W2
> > 6AA.
> > ______________________________________________________________________
> > This e-mail is intended only for the named person or entity to which
> it
> > is addressed and contains valuable business information that is
> > privileged, confidential and/or otherwise protected from disclosure.
> > Dissemination, distribution or copying of this e-mail or the
> > information herein by anyone other than the intended recipient, or an
> > employee, or agent responsible for delivering the message to the
> > intended recipient, is strictly prohibited. All contents are the
> > copyright property of the sender. If you are not the intended
> > recipient, you are nevertheless bound to respect the sender's
> worldwide
> > legal rights. We require that unintended recipients delete the e-mail
> > and destroy all electronic copies in their system, retaining no copies
> > in any media.
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its
> registered office situated at 12 Bishops Bridge Road, London W2 6AA.
> ______________________________________________________________________
> This e-mail is intended only for the named person or entity to which it is
> addressed and contains valuable business information that is privileged,
> confidential and/or otherwise protected from disclosure. Dissemination,
> distribution or copying of this e-mail or the information herein by anyone
> other than the intended recipient, or an employee, or agent responsible for
> delivering the message to the intended recipient, is strictly prohibited.
> All contents are the copyright property of the sender. If you are not the
> intended recipient, you are nevertheless bound to respect the sender's
> worldwide legal rights. We require that unintended recipients delete the
> e-mail and destroy all electronic copies in their system, retaining no
> copies in any media.
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to