Re: cost model for MR programs

indoos Fri, 28 Aug 2009 19:38:30 -0700

Hi,
My suggestion would be that we should not be compelling ourselves to compare
databases with Hadoop.
However, here is something not probably even close to what you may require,
but might be helpful-
1. Number of nodes - these are the parameters to look for -
- average time taken by a single Map and Reduce task (available as part of
history-analytics), 
- Max Input file size vs block size. Lets take an example- A 6GB input file
with 64 MB block size would  ideally require ~1000 Maps. The more you want
to run these 1000 Maps in parallel, more the number of nodes. A 10 node
cluster with 10 Maps would have to run ~10 times in a kind of sequential
mode :-( 
- ultimately it is the time vs cost factor to decide the number of nodes. So
for this example, if a map takes at least 2 minutes, the ~minimum time would
be 2*10=20 minutes. Less time would mean more nodes.
- The number of Jobs that you might decide to run at the same time would
also affect the number of nodes. Effectively every individual job task
(map/reduce) runs in a sequential kind of mode waiting in the queue for the
existing/executing map/reduce block to finish. (Off course, we have some
prioritization support - this does not however help to finish everything in
parallel)
2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker,
Secondary Name node on the masters side. On slave side- 1 GB RAM each for
task tracker and data node which leaves practically not much for good
computing on a commodity 8GB machine. The remaining 5-6 GB can then be used
for Map Reduce tasks. So with our example of running 10 Maps, we would have
at the most a Map using at max 400-500 MB heap. Anything beyond this would
require either the Maps to be reduced or the RAM to be increased.
3. Network speed- Hadoop recommends(I think I did read it
somewhere-apologies if otherwise) using at least 1 GB/s networks for the
heavy data transfer. My experiences with 100 MB/sec in even a dev env have
been disastrous
4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively
available. So given a 4TB hard disk, effectively only 1 TB can be used for
real data with 2 TB used for replication (3-ideal replication factor) and 1
TB for temp usage
Regards,
Sanjay


bharath vissapragada-2 wrote:
> 
> Hi all ,
> 
> Is there any general cost model that can be used to guess the run time of
> a
> program (similar to Page IO/s , selectivity factors in RDBMS) in terms of
> any config aspects such as number of nodes/page IO/s etc .
> 
> 

-- 
View this message in context: 
http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: cost model for MR programs

Reply via email to