Brian, This particular task takes time in computation, in the order of minutes. ---- T. "Kuro" Kurosaka From: Brian Bockelman <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wed, 31 Aug 2011 08:04:07 -0400 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: How big data and/or how many machines do I need to take advantage of Hadoop?
Hi Kuro, A 100MB file should take 1 second to read; typically, MR jobs get scheduled on the order of seconds. So, it's unlikely you'll see any benefit. You'll probably want to have a look at Amdahl's law: http://en.wikipedia.org/wiki/Amdahl%27s_law<http://en.wikipedia.org/wiki/Amdahl's_law> Brian On Aug 31, 2011, at 3:48 AM, Teruhiko Kurosaka wrote: Hadoop newbie here. I wrapped my company's entity extraction product in a Hadoop task, and give it a large file of the magnitude of 100MB. I have 4 VMs running on a 24-core CPU server, and made two of them the slave nodes, one namenode and another job tracker. It turned out that processing the same data size takes longer using Hadoop than processing it in serial. I am curious that how I can experience the advantage of Hadoop. Is having many physical machines essential? Would I need to process Terabytes of data? What would be the minimum set up where I can experience the advantage of Hadoop? ---- T. "Kuro" Kurosaka
