On Sun, Sep 21, 2008 at 5:37 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:

>
>
>
> On 9/21/08 9:40 AM, "Guilherme Menezes" <[EMAIL PROTECTED]>
> wrote:
> > We currently have 4 nodes (16GB of
> > ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial
> plans
> > are to perform a Web crawl for academic purposes (something between 500
> > million and 1 billion pages), and we need to expand the number of nodes
> for
> > that. Is it better to have a larger number of nodes simpler than the ones
> we
> > currently have (less memory, less processing?) in terms of Hadoop
> > performance?
>
>     Your current boxes seem overpowered for crawling. If it were me, I'd
> probably:


Hi, thanks for the answer.

We'll have to perform some tests to check if they really are overpowered for
our needs, but I guess we would get a lot more of parallelism if we had more
nodes (more disks). As soon as we get to some conclusion I'll post it here!


>
>
>        a) turn the current four machines into dedicated namenode, job
> tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup
> an nfs server on it and run it as your secondary direct copy of the fsimage
> and edits file if you don't have one).   With 16gb name nodes, you should
> be
> able to store a lot of data.


Very useful suggestions. To clarify the "oh-no-a-machine-just-died! backup
node", what is the difference between this kind of backup (NFS) and the
secondary name node backup, and why do we need both of them?


>
>
>        b) when you buy new nodes, I'd cut down on memory and cpu and just
> turn them into your work horses
>
>    That said, I know little-to-nothing about crawling.  So, IMHO on the
> above.


We are studing how Nutch works right now to understand better how crawling
is done with map-reduce. Maybe the nutch-users list would be a better place
to post this questions, but thanks anyway!

Reply via email to