On Sun, Sep 21, 2008 at 5:37 PM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:
> > > > On 9/21/08 9:40 AM, "Guilherme Menezes" <[EMAIL PROTECTED]> > wrote: > > We currently have 4 nodes (16GB of > > ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial > plans > > are to perform a Web crawl for academic purposes (something between 500 > > million and 1 billion pages), and we need to expand the number of nodes > for > > that. Is it better to have a larger number of nodes simpler than the ones > we > > currently have (less memory, less processing?) in terms of Hadoop > > performance? > > Your current boxes seem overpowered for crawling. If it were me, I'd > probably: Hi, thanks for the answer. We'll have to perform some tests to check if they really are overpowered for our needs, but I guess we would get a lot more of parallelism if we had more nodes (more disks). As soon as we get to some conclusion I'll post it here! > > > a) turn the current four machines into dedicated namenode, job > tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup > an nfs server on it and run it as your secondary direct copy of the fsimage > and edits file if you don't have one). With 16gb name nodes, you should > be > able to store a lot of data. Very useful suggestions. To clarify the "oh-no-a-machine-just-died! backup node", what is the difference between this kind of backup (NFS) and the secondary name node backup, and why do we need both of them? > > > b) when you buy new nodes, I'd cut down on memory and cpu and just > turn them into your work horses > > That said, I know little-to-nothing about crawling. So, IMHO on the > above. We are studing how Nutch works right now to understand better how crawling is done with map-reduce. Maybe the nutch-users list would be a better place to post this questions, but thanks anyway!
