Speaking about NFS-backup idea: If I have secure nfs storage which is much slower than network (3MB/d vs 100MB/s network we use between nodes) will it adversely affect performance, or I can rely on NFS caching to do the job?
And if nfs share dies, will it shutdown the namenode as well? -----Original Message----- From: Allen Wittenauer [mailto:[EMAIL PROTECTED] Sent: Sunday, September 21, 2008 1:38 PM To: [email protected] Subject: Re: Hadoop Cluster Size Scalability Numbers? On 9/21/08 9:40 AM, "Guilherme Menezes" <[EMAIL PROTECTED]> wrote: > We currently have 4 nodes (16GB of > ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans > are to perform a Web crawl for academic purposes (something between 500 > million and 1 billion pages), and we need to expand the number of nodes for > that. Is it better to have a larger number of nodes simpler than the ones we > currently have (less memory, less processing?) in terms of Hadoop > performance? Your current boxes seem overpowered for crawling. If it were me, I'd probably: a) turn the current four machines into dedicated namenode, job tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup an nfs server on it and run it as your secondary direct copy of the fsimage and edits file if you don't have one). With 16gb name nodes, you should be able to store a lot of data. b) when you buy new nodes, I'd cut down on memory and cpu and just turn them into your work horses That said, I know little-to-nothing about crawling. So, IMHO on the above.
