Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Piotr Kosiorowski Tue, 24 May 2005 04:48:55 -0700

If I were you I would split 13mln pages in 3 equal or nearly equal
parts and distribute it over backend servers - without going into how
many pages are not correctly indexed in this segments. I would assume
not indexed pages should be distributed equally in all segments. It is
all very rough estimate but when you would like to go into details you
would have to take into account avarage number of tokens in a page in
each segment and probably a distribution of tokens across segments.


So to sum up I would make a rough assumption that all segments have
the same distribution features search speed depends on and try it out
by splitting it into equal parts. And only if it  would not work as
expected I would start to think how to optimize it.
Regards
Piotr



On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Hi Piotr,
> 
> Thank for answer, but I don't understand how to calculate how many
> segments put to one backend?
> How to calculate the page numbers? In my case, there is 13 million pages
> in the segments with segread, but only 7,5 million for searcing 'http'.
> I have 3 backend, and I would like to balance the segments between them.
> On the server I can't use the lukeall tool, becouse there isn't grafical
> interface. To copy all segments to local, and view these with lukeall to
> large work.
> 
> Regards,
> Ferenc
> 
> Piotr Kosiorowski wrotte:
> 
> >Hi Ferenc,
> >
> >'bin/nutch segread -list'  reports number of entries  in fetcher
> >output - so if the data  is not corrupted - it should report total
> >number of entries generated during fetchlist generation. luke on the
> >other hand reports number of documents in lucene index - so it will
> >include all pages that were correctly processed - so it will not
> >report all pages that where not fetched because of errors or pages
> >that were not parsed succesfully etc.  And this is the number returned
> >when you search for "http" because only correctly indexed pages are
> >searchable.
> >Regards
> >Piotr
> >
> >On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >
> >
> >>Dear Chirag and Byron,
> >>
> >>Thanks for suggestion, but I don't have any problem with other
> >>applications under Tomcat. Problem is occured with only nutch.
> >>There is free version of Resin, this is truly better than Tomcat?
> >>
> >>Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
> >>backend.
> >>How to calculate the pages number in the segments?
> >>If I use the 'bin/nutch segread -list' tool this is say a segment there
> >>are 500000 pages in it.
> >>If I use 'lukeall.jar' tool it is say there are 420105 records in that
> >>segment.
> >>If I use 'lukeall.jar' undelete function, there are 438000 records in
> >>the same segments.
> >>If I use websearch engine with searching for 'http', this says equal to
> >>'lukeall.jar'.
> >>
> >>What number to use to calculate pages / backend?
> >>
> >>I think my solution of the 'paginating' is better than reported others.
> >>Any comment?
> >>
> >>Thanks, Ferenc
> >>
> >>
> >>
> >
> >
> >
> >
> 
>

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

Reply via email to