Doug,

I appreciate the feedback.  We have altered the Nutch implementation to add
a field for retrieval later use.  The field holds information we use to sort
the data differently than what Nutch has out of the box.  Is it possible to
do the following:

1.  We would need to alter the code to only grab the document ids of n
number of documents (500, 1000, 10,000). We could do this by using NutchBean
instead, or modify OpenSearchServlet to conditionally generate summaries. 

2.  We would then be able to look at the field we added to the data
associated with the document ids for sorting the n number of documents using
our own sorting mechanism. 

3.  We would then generate the summaries of the first 10 documents based on
the newly sorted list of document ids.
 
4.  We would then display those 10 results with the summaries.
 
5.  When a user clicks to go to the next 10 results we would already have
the next 10 ids stored somewhere and could generate the summaries to the
next 10 ids without having to relookup everything.

What do you think?

Thanks,

Paul

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 16, 2005 3:27 PM
To: [email protected]
Subject: Re: Slow Results

What API are you using to get hits, NutchBean or OpenSearchServlet?  If 
you're using OpenSearchServlet, then, with 1000 hits, most of your time 
is probably spent constructing summaries.  Do you need the summaries? 
If not, use NutchBean instead, or modify OpenSearchServlet to not 
generate summaries.  If you only need unique document ids, then perhaps 
you can only fetch the Hit instance for each match.  That would be 
fastest.  If you need titles, urls, etc., then you need HitDetails, 
which are slower to access.  Slowest is summaries.

Doug

Paul Harrison wrote:
> I have crawled some 100 million pages and am running this on five P4 3.0
GHz
> machines with a 40 GB OS drive and two 250 GB data drives.  I am trying to
> get Nutch to grab 1000 results so I can pass them to a separate program I
> have instead of using the Nutch default (100 I think).  As a result it
takes
> an enormous amount of time to get results.  So I backed the number of
pages
> indexed to 7 million and still having Nutch grab 1000 results instead of
the
> default.  While the results were better they are still unusable as it is
> taking between 15 and 20 seconds to complete the task.  Does anyone have
any
> idea why Nutch slows down so bad when you have it grab 1000 pages instead
of
> the default number?  Does anyone have any suggestions on how to speed this
> process up?  Do I use more machines, upgrade to a newer version of Nutch,
> etc.?
> 
>  
> 
> Any help would be MOST appreciated.
> 
>  
> 
> Thanks,
> 
>  
> 
> Paul
> 
> 

Reply via email to