Hi. The crawlers are _very_ threaded but no we use our own threading framework since it was not available at the time on hadoop-core.
Crawlers normally just wait a lot on clients inducing very little CPU but consumes some memory due to the parallellism. //Marcus On Sat, Jun 27, 2009 at 6:10 PM, jason hadoop <jason.had...@gmail.com>wrote: > How about multi-threaded mappers? > Multi-Threaded mappers are ideal for map tasks that are non locally io > bound > with many distinct endpoints. > You can also control the thread count on a per job basis. > > On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou <marcus.he...@tailsweep.com > >wrote: > > > The argument currently against increasing num-mappers is that the > machines > > will get into oom and since a lot of the jobs are crawlers I need more > > ip-numbers so I don't get banned :) > > > > Thing is that we currently have solr on the very same machines and > > data-nodes as well so I can only give the MR nodes about 1G memory since > I > > need SOLR to have 4G... > > > > Now I see that I should get some obvious and juste critique about the > > layout > > of this arch but I'm a little limited in budget and so is then the arch > :) > > > > However is it wise to have the MR tasks on the same nodes as the > data-nodes > > or should I split the arch ? I mean the data-nodes perhaps need more > > disk-IO > > and the MR more memory and CPU ? > > > > Trying to find a sweetspot hardware spec of those two roles. > > > > //Marcus > > > > > > > > On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbock...@cse.unl.edu > > >wrote: > > > > > Hey Marcus, > > > > > > Are you recording the data rates coming out of HDFS? Since you have > such > > a > > > low CPU utilizations, I'd look at boxes utterly packed with big hard > > drives > > > (also, why are you using RAID1 for Hadoop??). > > > > > > You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. > > > Based on the data rates you see, make the call. > > > > > > On the other hand, what's the argument against running 3x more mappers > > per > > > box? It seems that your boxes still have more overhead to use -- > there's > > no > > > I/O wait. > > > > > > Brian > > > > > > > > > On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: > > > > > > Hi. > > >> > > >> We have a deployment of 10 hadoop servers and I now need more mapping > > >> capability (no not just add more mappers per instance) since I have so > > >> many > > >> jobs running. Now I am wondering what I should aim on... > > >> Memory, cpu or disk... How long is a rope perhaps you would say ? > > >> > > >> A typical server is currently using about 15-20% cpu today on a > > quad-core > > >> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. > > >> > > >> Some specs below. > > >> > > >>> mpstat 2 5 > > >>> > > >> Linux 2.6.24-19-server (mapreduce2) 06/26/2009 > > >> > > >> 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft > %steal > > >> %idle intr/s > > >> 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 > 0.00 > > >> 69.45 8572.50 > > >> 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 > 0.00 > > >> 79.48 8075.50 > > >> 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 > 0.00 > > >> 78.95 9219.00 > > >> 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 > 0.00 > > >> 80.80 8489.50 > > >> 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 > 0.00 > > >> 83.96 5495.00 > > >> Average: all 15.62 0.00 1.79 1.47 0.62 1.97 > 0.00 > > >> 78.53 7970.30 > > >> > > >> What I am thinking is... Is it wiser to go for many of these cheap > boxes > > >> with 8GB of RAM or should I for instance focus on machines which can > > give > > >> more I|O throughput ? > > >> > > >> I know that these things are hard but perhaps someone have draw some > > >> conclusions before the pragmatic way. > > >> > > >> Kindly > > >> > > >> //Marcus > > >> > > >> > > >> -- > > >> Marcus Herou CTO and co-founder Tailsweep AB > > >> +46702561312 > > >> marcus.he...@tailsweep.com > > >> http://www.tailsweep.com/ > > >> > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > marcus.he...@tailsweep.com > > http://www.tailsweep.com/ > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.amazon.com/dp/1430219424?tag=jewlerymall > www.prohadoopbook.com a community for Hadoop Professionals > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/