Hi, My suggestion would be that we should not be compelling ourselves to compare databases with Hadoop. However, here is something not probably even close to what you may require, but might be helpful- 1. Number of nodes - these are the parameters to look for - - average time taken by a single Map and Reduce task (available as part of history-analytics), - Max Input file size vs block size. Lets take an example- A 6GB input file with 64 MB block size would ideally require ~1000 Maps. The more you want to run these 1000 Maps in parallel, more the number of nodes. A 10 node cluster with 10 Maps would have to run ~10 times in a kind of sequential mode :-( - ultimately it is the time vs cost factor to decide the number of nodes. So for this example, if a map takes at least 2 minutes, the ~minimum time would be 2*10=20 minutes. Less time would mean more nodes. - The number of Jobs that you might decide to run at the same time would also affect the number of nodes. Effectively every individual job task (map/reduce) runs in a sequential kind of mode waiting in the queue for the existing/executing map/reduce block to finish. (Off course, we have some prioritization support - this does not however help to finish everything in parallel) 2. RAM - a general thumb rule is, 1 GB RAM each for Name Node, Job Tracker, Secondary Name node on the masters side. On slave side- 1 GB RAM each for task tracker and data node which leaves practically not much for good computing on a commodity 8GB machine. The remaining 5-6 GB can then be used for Map Reduce tasks. So with our example of running 10 Maps, we would have at the most a Map using at max 400-500 MB heap. Anything beyond this would require either the Maps to be reduced or the RAM to be increased. 3. Network speed- Hadoop recommends(I think I did read it somewhere-apologies if otherwise) using at least 1 GB/s networks for the heavy data transfer. My experiences with 100 MB/sec in even a dev env have been disastrous 4. Hard disk- again a thumb rule- Only 1/4 memory would be effectively available. So given a 4TB hard disk, effectively only 1 TB can be used for real data with 2 TB used for replication (3-ideal replication factor) and 1 TB for temp usage Regards, Sanjay
bharath vissapragada-2 wrote: > > Hi all , > > Is there any general cost model that can be used to guess the run time of > a > program (similar to Page IO/s , selectivity factors in RDBMS) in terms of > any config aspects such as number of nodes/page IO/s etc . > > -- View this message in context: http://www.nabble.com/cost-model-for-MR-programs-tp25127531p25199508.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
