Yeah I've played with that.  My current value for file.content.limit is
1000000000.  That's considerably longer than the page I'm having
problems with.

Its fast approaching that time where I have to split the page in to lots
of smaller pages. :-(  Thankfully owning the site that we use nutch on
appears to be the only solution to this summary issue.


Tim.. 

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: 22 April 2010 21:56
To: nutch-user@lucene.apache.org
Subject: Re: Is there some arbitrary limit on content stored for use by
summaries?

Try refetching with a different value for :

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

Julien

On 22 April 2010 18:44, Tim Redding <tim.redd...@tribalddb.co.uk> wrote:

> Hey Arkadi,
>
> I've tried upping the value to Integer.MAX_VALUE but it still doesn't
> show a relevant summary. :-(
>
> Any other ideas?
>
>
>
> Tim..
>
>
>
>
>
>
> -----Original Message-----
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: 21 April 2010 23:29
> To: nutch-user@lucene.apache.org
> Subject: RE: Is there some arbitrary limit on content stored for use
by
> summaries?
>
> Hi Tim,
>
> I would think that this parameter is related to the problem you
> describe, but the default value should allow indexing pages of the
size
> you mention. Did you change this parameter?
>
> Regards,
>
> Arkadi
>
> <property>
>  <name>indexer.max.tokens</name>
>  <value>10000</value>
>  <description>
>  The maximum number of tokens that will be indexed for a single field
>  in a document. This limits the amount of memory required for
>  indexing, so that collections with very large files will not crash
>  the indexing process by running out of memory.
>
>  Note that this effectively truncates large documents, excluding
>  from the index tokens that occur further in the document. If you
>  know your source documents are large, be sure to set this value
>  high enough to accomodate the expected size. If you set it to
>  -1, then the only limit is your memory, but you should anticipate
>  an OutOfMemoryError.
>  </description>
> </property>
>
> > -----Original Message-----
> > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> > Sent: Thursday, 22 April 2010 2:18 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Is there some arbitrary limit on content stored for use by
> > summaries?
> >
> > Hey,
> >
> > We have a long page that appears in the search results but the
summary
> > never contains the search terms.  Why is this?
> >
> > If we move the text containing the search terms up the page they get
> > displayed in the summary so it's obviously related to some limit
> > imposed
> > somewhere.  I've looked though all the configuration options and
none
> > appear to change anything that sounds related to this.
> >
> > We use Nutch 1.0 and the the page in question is 8.7KB in size.
> >
> >
> > Any help please?
> >
> >
> > Tim..
> >
> >
> >
> >
> >
> >
> > Tim Redding
> > Senior Java Developer
> > Tribal DDB
> > 12 Bishop's Bridge Road
> > London W2 6AA
> > T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> >
> >
> > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> > its registered office situated at 12 Bishops Bridge Road, London W2
> > 6AA.
> >
______________________________________________________________________
> > This e-mail is intended only for the named person or entity to which
> it
> > is addressed and contains valuable business information that is
> > privileged, confidential and/or otherwise protected from disclosure.
> > Dissemination, distribution or copying of this e-mail or the
> > information herein by anyone other than the intended recipient, or
an
> > employee, or agent responsible for delivering the message to the
> > intended recipient, is strictly prohibited. All contents are the
> > copyright property of the sender. If you are not the intended
> > recipient, you are nevertheless bound to respect the sender's
> worldwide
> > legal rights. We require that unintended recipients delete the
e-mail
> > and destroy all electronic copies in their system, retaining no
copies
> > in any media.
> >
______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security
System.
> > For more information please visit http://www.messagelabs.com/email
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
its
> registered office situated at 12 Bishops Bridge Road, London W2 6AA.
> ______________________________________________________________________
> This e-mail is intended only for the named person or entity to which
it is
> addressed and contains valuable business information that is
privileged,
> confidential and/or otherwise protected from disclosure.
Dissemination,
> distribution or copying of this e-mail or the information herein by
anyone
> other than the intended recipient, or an employee, or agent
responsible for
> delivering the message to the intended recipient, is strictly
prohibited.
> All contents are the copyright property of the sender. If you are not
the
> intended recipient, you are nevertheless bound to respect the sender's
> worldwide legal rights. We require that unintended recipients delete
the
> e-mail and destroy all electronic copies in their system, retaining no
> copies in any media.
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its 
registered office situated at 12 Bishops Bridge Road, London W2 6AA.
______________________________________________________________________
This e-mail is intended only for the named person or entity to which it is 
addressed and contains valuable business information that is privileged, 
confidential and/or otherwise protected from disclosure. Dissemination, 
distribution or copying of this e-mail or the information herein by anyone 
other than the intended recipient, or an employee, or agent responsible for 
delivering the message to the intended recipient, is strictly prohibited. All 
contents are the copyright property of the sender. If you are not the intended 
recipient, you are nevertheless bound to respect the sender's worldwide legal 
rights. We require that unintended recipients delete the e-mail and destroy all 
electronic copies in their system, retaining no copies in any media. 
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email

Reply via email to