Try refetching with a different value for : <property> <name>file.content.limit</name> <value>65536</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property>
Julien On 22 April 2010 18:44, Tim Redding <tim.redd...@tribalddb.co.uk> wrote: > Hey Arkadi, > > I've tried upping the value to Integer.MAX_VALUE but it still doesn't > show a relevant summary. :-( > > Any other ideas? > > > > Tim.. > > > > > > > -----Original Message----- > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] > Sent: 21 April 2010 23:29 > To: nutch-user@lucene.apache.org > Subject: RE: Is there some arbitrary limit on content stored for use by > summaries? > > Hi Tim, > > I would think that this parameter is related to the problem you > describe, but the default value should allow indexing pages of the size > you mention. Did you change this parameter? > > Regards, > > Arkadi > > <property> > <name>indexer.max.tokens</name> > <value>10000</value> > <description> > The maximum number of tokens that will be indexed for a single field > in a document. This limits the amount of memory required for > indexing, so that collections with very large files will not crash > the indexing process by running out of memory. > > Note that this effectively truncates large documents, excluding > from the index tokens that occur further in the document. If you > know your source documents are large, be sure to set this value > high enough to accomodate the expected size. If you set it to > -1, then the only limit is your memory, but you should anticipate > an OutOfMemoryError. > </description> > </property> > > > -----Original Message----- > > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk] > > Sent: Thursday, 22 April 2010 2:18 AM > > To: nutch-user@lucene.apache.org > > Subject: Is there some arbitrary limit on content stored for use by > > summaries? > > > > Hey, > > > > We have a long page that appears in the search results but the summary > > never contains the search terms. Why is this? > > > > If we move the text containing the search terms up the page they get > > displayed in the summary so it's obviously related to some limit > > imposed > > somewhere. I've looked though all the configuration options and none > > appear to change anything that sounds related to this. > > > > We use Nutch 1.0 and the the page in question is 8.7KB in size. > > > > > > Any help please? > > > > > > Tim.. > > > > > > > > > > > > > > Tim Redding > > Senior Java Developer > > Tribal DDB > > 12 Bishop's Bridge Road > > London W2 6AA > > T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 > > > > > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with > > its registered office situated at 12 Bishops Bridge Road, London W2 > > 6AA. > > ______________________________________________________________________ > > This e-mail is intended only for the named person or entity to which > it > > is addressed and contains valuable business information that is > > privileged, confidential and/or otherwise protected from disclosure. > > Dissemination, distribution or copying of this e-mail or the > > information herein by anyone other than the intended recipient, or an > > employee, or agent responsible for delivering the message to the > > intended recipient, is strictly prohibited. All contents are the > > copyright property of the sender. If you are not the intended > > recipient, you are nevertheless bound to respect the sender's > worldwide > > legal rights. We require that unintended recipients delete the e-mail > > and destroy all electronic copies in their system, retaining no copies > > in any media. > > ______________________________________________________________________ > > This email has been scanned by the MessageLabs Email Security System. > > For more information please visit http://www.messagelabs.com/email > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its > registered office situated at 12 Bishops Bridge Road, London W2 6AA. > ______________________________________________________________________ > This e-mail is intended only for the named person or entity to which it is > addressed and contains valuable business information that is privileged, > confidential and/or otherwise protected from disclosure. Dissemination, > distribution or copying of this e-mail or the information herein by anyone > other than the intended recipient, or an employee, or agent responsible for > delivering the message to the intended recipient, is strictly prohibited. > All contents are the copyright property of the sender. If you are not the > intended recipient, you are nevertheless bound to respect the sender's > worldwide legal rights. We require that unintended recipients delete the > e-mail and destroy all electronic copies in their system, retaining no > copies in any media. > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > -- DigitalPebble Ltd http://www.digitalpebble.com