Hello Avi, I just encountered your email, so my answer might not be relevant by now.
Even though your file size is 150+ M I would still reduce your block size to 64MB. How many mappers do you run concurrently? With 1.7G of memory I would start with 4 mappers per machine. How many nodes do you have (didn't see that in your email)? That would be a factor to determine the number of reducers (as well as the dfs.block.size). Cheers, Keren On Thu, Aug 18, 2011 at 10:59 AM, אבי ווקנין <avivakni...@gmail.com> wrote: > Hi all ! > How are you? > > My name is Ronen and I have been fascinated by Apache Hadoop for the last > few months. > I am spending the last two weeks trying to optimize my configuration files > and environment. > I have been going through many Hadoop's configuration properties and it > seems that none > of them is making a big difference (+- 3 minutes of a total job run time). > > In Hadoop's meanings my cluster considered to be extremely small (260 GB of > text files, while every job is going through only +- 8 GB). > I have one server acting as "NameNode and JobTracker", and another 5 > servers > acting as "DataNodes and TaskTreckers". > Right now Hadoop's configurations are set to default, beside the DFS Block > Size which is set to 256 MB since every file on my cluster takes 155 - 250 > MB. > > All of the above servers are exactly the same and having the following > hardware and software: > 1.7 GB memory > 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz > Ubuntu Server 10.10 , 32-bit platform > Cloudera CDH3 Manual Hadoop Installation > (for the ones who are familiar with Amazon Web Services, I am talking about > Small EC2 Instances/Servers) > > Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250 > MB and 10 reduce tasks). > > Based on the above information, does anyone can recommend on a best > practice > configuration?? > Do you thinks that when dealing with such a small cluster, and when > processing such a small amount of data, > is it even possible to optimize jobs so they would run much faster? > > By the way, it seems like none of the nodes are having a hardware > performance issues (CPU/Memory) while running the job. > Thats true unless I am having a bottle neck somewhere else (seems like > network bandwidth is not the issue). > That issue is a little confusing because the NameNode process and the > JobTracker process should allocate 1GB of memory each, > which means that my hardware starting point is insufficient and in that > case > why am I not seeing a full Memory utilization using 'top' > command on the NameNode & JobTracker Server? > How would you recommend to measure/monitor different Hadoop's properties to > find out where is the bottle neck? > > Thanks for your help!! > > Avi > -- Keren Ouaknine Web: www.kereno.com