How about multi-threaded mappers? Multi-Threaded mappers are ideal for map tasks that are non locally io bound with many distinct endpoints. You can also control the thread count on a per job basis.
On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou <marcus.he...@tailsweep.com>wrote: > The argument currently against increasing num-mappers is that the machines > will get into oom and since a lot of the jobs are crawlers I need more > ip-numbers so I don't get banned :) > > Thing is that we currently have solr on the very same machines and > data-nodes as well so I can only give the MR nodes about 1G memory since I > need SOLR to have 4G... > > Now I see that I should get some obvious and juste critique about the > layout > of this arch but I'm a little limited in budget and so is then the arch :) > > However is it wise to have the MR tasks on the same nodes as the data-nodes > or should I split the arch ? I mean the data-nodes perhaps need more > disk-IO > and the MR more memory and CPU ? > > Trying to find a sweetspot hardware spec of those two roles. > > //Marcus > > > > On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbock...@cse.unl.edu > >wrote: > > > Hey Marcus, > > > > Are you recording the data rates coming out of HDFS? Since you have such > a > > low CPU utilizations, I'd look at boxes utterly packed with big hard > drives > > (also, why are you using RAID1 for Hadoop??). > > > > You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. > > Based on the data rates you see, make the call. > > > > On the other hand, what's the argument against running 3x more mappers > per > > box? It seems that your boxes still have more overhead to use -- there's > no > > I/O wait. > > > > Brian > > > > > > On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: > > > > Hi. > >> > >> We have a deployment of 10 hadoop servers and I now need more mapping > >> capability (no not just add more mappers per instance) since I have so > >> many > >> jobs running. Now I am wondering what I should aim on... > >> Memory, cpu or disk... How long is a rope perhaps you would say ? > >> > >> A typical server is currently using about 15-20% cpu today on a > quad-core > >> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. > >> > >> Some specs below. > >> > >>> mpstat 2 5 > >>> > >> Linux 2.6.24-19-server (mapreduce2) 06/26/2009 > >> > >> 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft %steal > >> %idle intr/s > >> 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 0.00 > >> 69.45 8572.50 > >> 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 0.00 > >> 79.48 8075.50 > >> 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 0.00 > >> 78.95 9219.00 > >> 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 0.00 > >> 80.80 8489.50 > >> 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 0.00 > >> 83.96 5495.00 > >> Average: all 15.62 0.00 1.79 1.47 0.62 1.97 0.00 > >> 78.53 7970.30 > >> > >> What I am thinking is... Is it wiser to go for many of these cheap boxes > >> with 8GB of RAM or should I for instance focus on machines which can > give > >> more I|O throughput ? > >> > >> I know that these things are hard but perhaps someone have draw some > >> conclusions before the pragmatic way. > >> > >> Kindly > >> > >> //Marcus > >> > >> > >> -- > >> Marcus Herou CTO and co-founder Tailsweep AB > >> +46702561312 > >> marcus.he...@tailsweep.com > >> http://www.tailsweep.com/ > >> > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > marcus.he...@tailsweep.com > http://www.tailsweep.com/ > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals