Doug, I appreciate the feedback. We have altered the Nutch implementation to add a field for retrieval later use. The field holds information we use to sort the data differently than what Nutch has out of the box. Is it possible to do the following:
1. We would need to alter the code to only grab the document ids of n number of documents (500, 1000, 10,000). We could do this by using NutchBean instead, or modify OpenSearchServlet to conditionally generate summaries. 2. We would then be able to look at the field we added to the data associated with the document ids for sorting the n number of documents using our own sorting mechanism. 3. We would then generate the summaries of the first 10 documents based on the newly sorted list of document ids. 4. We would then display those 10 results with the summaries. 5. When a user clicks to go to the next 10 results we would already have the next 10 ids stored somewhere and could generate the summaries to the next 10 ids without having to relookup everything. What do you think? Thanks, Paul -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 16, 2005 3:27 PM To: [email protected] Subject: Re: Slow Results What API are you using to get hits, NutchBean or OpenSearchServlet? If you're using OpenSearchServlet, then, with 1000 hits, most of your time is probably spent constructing summaries. Do you need the summaries? If not, use NutchBean instead, or modify OpenSearchServlet to not generate summaries. If you only need unique document ids, then perhaps you can only fetch the Hit instance for each match. That would be fastest. If you need titles, urls, etc., then you need HitDetails, which are slower to access. Slowest is summaries. Doug Paul Harrison wrote: > I have crawled some 100 million pages and am running this on five P4 3.0 GHz > machines with a 40 GB OS drive and two 250 GB data drives. I am trying to > get Nutch to grab 1000 results so I can pass them to a separate program I > have instead of using the Nutch default (100 I think). As a result it takes > an enormous amount of time to get results. So I backed the number of pages > indexed to 7 million and still having Nutch grab 1000 results instead of the > default. While the results were better they are still unusable as it is > taking between 15 and 20 seconds to complete the task. Does anyone have any > idea why Nutch slows down so bad when you have it grab 1000 pages instead of > the default number? Does anyone have any suggestions on how to speed this > process up? Do I use more machines, upgrade to a newer version of Nutch, > etc.? > > > > Any help would be MOST appreciated. > > > > Thanks, > > > > Paul > >
