If I were you I would split 13mln pages in 3 equal or nearly equal parts and distribute it over backend servers - without going into how many pages are not correctly indexed in this segments. I would assume not indexed pages should be distributed equally in all segments. It is all very rough estimate but when you would like to go into details you would have to take into account avarage number of tokens in a page in each segment and probably a distribution of tokens across segments.
So to sum up I would make a rough assumption that all segments have the same distribution features search speed depends on and try it out by splitting it into equal parts. And only if it would not work as expected I would start to think how to optimize it. Regards Piotr On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi Piotr, > > Thank for answer, but I don't understand how to calculate how many > segments put to one backend? > How to calculate the page numbers? In my case, there is 13 million pages > in the segments with segread, but only 7,5 million for searcing 'http'. > I have 3 backend, and I would like to balance the segments between them. > On the server I can't use the lukeall tool, becouse there isn't grafical > interface. To copy all segments to local, and view these with lukeall to > large work. > > Regards, > Ferenc > > Piotr Kosiorowski wrotte: > > >Hi Ferenc, > > > >'bin/nutch segread -list' reports number of entries in fetcher > >output - so if the data is not corrupted - it should report total > >number of entries generated during fetchlist generation. luke on the > >other hand reports number of documents in lucene index - so it will > >include all pages that were correctly processed - so it will not > >report all pages that where not fetched because of errors or pages > >that were not parsed succesfully etc. And this is the number returned > >when you search for "http" because only correctly indexed pages are > >searchable. > >Regards > >Piotr > > > >On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > > > >>Dear Chirag and Byron, > >> > >>Thanks for suggestion, but I don't have any problem with other > >>applications under Tomcat. Problem is occured with only nutch. > >>There is free version of Resin, this is truly better than Tomcat? > >> > >>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the > >>backend. > >>How to calculate the pages number in the segments? > >>If I use the 'bin/nutch segread -list' tool this is say a segment there > >>are 500000 pages in it. > >>If I use 'lukeall.jar' tool it is say there are 420105 records in that > >>segment. > >>If I use 'lukeall.jar' undelete function, there are 438000 records in > >>the same segments. > >>If I use websearch engine with searching for 'http', this says equal to > >>'lukeall.jar'. > >> > >>What number to use to calculate pages / backend? > >> > >>I think my solution of the 'paginating' is better than reported others. > >>Any comment? > >> > >>Thanks, Ferenc > >> > >> > >> > > > > > > > > > >
