Hi Ferenc, 'bin/nutch segread -list' reports number of entries in fetcher output - so if the data is not corrupted - it should report total number of entries generated during fetchlist generation. luke on the other hand reports number of documents in lucene index - so it will include all pages that were correctly processed - so it will not report all pages that where not fetched because of errors or pages that were not parsed succesfully etc. And this is the number returned when you search for "http" because only correctly indexed pages are searchable. Regards Piotr
On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Dear Chirag and Byron, > > Thanks for suggestion, but I don't have any problem with other > applications under Tomcat. Problem is occured with only nutch. > There is free version of Resin, this is truly better than Tomcat? > > Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the > backend. > How to calculate the pages number in the segments? > If I use the 'bin/nutch segread -list' tool this is say a segment there > are 500000 pages in it. > If I use 'lukeall.jar' tool it is say there are 420105 records in that > segment. > If I use 'lukeall.jar' undelete function, there are 438000 records in > the same segments. > If I use websearch engine with searching for 'http', this says equal to > 'lukeall.jar'. > > What number to use to calculate pages / backend? > > I think my solution of the 'paginating' is better than reported others. > Any comment? > > Thanks, Ferenc >
