This is a good answer. Thanks for it,
Ferenc

Piotr Kosiorowski wrotte:

If I were you I would split 13mln pages in 3 equal or nearly equal
parts and distribute it over backend servers - without going into how
many pages are not correctly indexed in this segments. I would assume
not indexed pages should be distributed equally in all segments. It is
all very rough estimate but when you would like to go into details you
would have to take into account avarage number of tokens in a page in
each segment and probably a distribution of tokens across segments.

So to sum up I would make a rough assumption that all segments have
the same distribution features search speed depends on and try it out
by splitting it into equal parts. And only if it  would not work as
expected I would start to think how to optimize it.
Regards
Piotr



On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hi Piotr,

Thank for answer, but I don't understand how to calculate how many
segments put to one backend?
How to calculate the page numbers? In my case, there is 13 million pages
in the segments with segread, but only 7,5 million for searcing 'http'.
I have 3 backend, and I would like to balance the segments between them.
On the server I can't use the lukeall tool, becouse there isn't grafical
interface. To copy all segments to local, and view these with lukeall to
large work.

Regards,
Ferenc

Piotr Kosiorowski wrotte:

Hi Ferenc,

'bin/nutch segread -list'  reports number of entries  in fetcher
output - so if the data  is not corrupted - it should report total
number of entries generated during fetchlist generation. luke on the
other hand reports number of documents in lucene index - so it will
include all pages that were correctly processed - so it will not
report all pages that where not fetched because of errors or pages
that were not parsed succesfully etc.  And this is the number returned
when you search for "http" because only correctly indexed pages are
searchable.
Regards
Piotr

On 5/24/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


Dear Chirag and Byron,

Thanks for suggestion, but I don't have any problem with other
applications under Tomcat. Problem is occured with only nutch.
There is free version of Resin, this is truly better than Tomcat?

Dear Chirag, You wrotte that, put 1G memory / 1 million pages to the
backend.
How to calculate the pages number in the segments?
If I use the 'bin/nutch segread -list' tool this is say a segment there
are 500000 pages in it.
If I use 'lukeall.jar' tool it is say there are 420105 records in that
segment.
If I use 'lukeall.jar' undelete function, there are 438000 records in
the same segments.
If I use websearch engine with searching for 'http', this says equal to
'lukeall.jar'.

What number to use to calculate pages / backend?

I think my solution of the 'paginating' is better than reported others.
Any comment?

Thanks, Ferenc








Reply via email to