Hi !

i'm wondering what is a "good enough" size for a nutch node (using hadoop).
I'm currently playing with nutch in "local" mode on my desktop
computer (linux, 4GB ram, 2.4Ghz quadcore).

I may you nutch at work (not sure yet). We use a lot of VM (xen).
I think about, per VM :
- 1 core (~2.5Ghz)
- 2GB Ram
- 100GB data (hdfs)

And an index size of ~20 to 50 Millions of pages.
I'm not sure about the disk size... But it seems that the bigger is
the node (in disk), the more cpu we need.
So i'm looking for a good ratio in disk/core.
(and GB is not so cheap if you use good disk and good quality
datacenter, so i'm not ready to loose too many GB of free space per
node).

another question is :
Is it possible to create an index limited to a specific domain (and
it's subdomain) to be sure to grab as much page as possible on this
domain.
and additionally, in parallel, running another crawler for "the whole internet".
And merge both index together ? (so i'm sure that the domain i'm
interested in is fully indexed)
If yes :
what do i need to merge ?

thank you.


-- 
F4FQM
Kerunix Flan
Laurent Laborde

Reply via email to