Re: Apache Hadoop optimization

Keren Ouaknine Sat, 11 Feb 2012 11:59:48 -0800

Hello Avi,

I just encountered your email, so my answer might not be relevant by now.


Even though your file size is 150+ M I would still reduce your block size
to 64MB. How many mappers do you run concurrently? With 1.7G of memory I
would start with 4 mappers per machine. How many nodes do you have (didn't
see that in your email)? That would be a factor to determine the number of
reducers (as well as the dfs.block.size).

Cheers,
Keren

On Thu, Aug 18, 2011 at 10:59 AM, אבי ווקנין <avivakni...@gmail.com> wrote:

> Hi all !
> How are you?
>
> My name is Ronen and I have been fascinated by Apache Hadoop for the last
> few months.
> I am spending the last two weeks trying to optimize my configuration files
> and environment.
> I have been going through many Hadoop's configuration properties and it
> seems that none
> of them is  making a big difference (+- 3 minutes of a total job run time).
>
> In Hadoop's meanings my cluster considered to be extremely small (260 GB of
> text files, while every job is going through only +- 8 GB).
> I have one server acting as "NameNode and JobTracker", and another 5
> servers
> acting as "DataNodes and TaskTreckers".
> Right now Hadoop's configurations are set to default, beside the DFS Block
> Size which is set to 256 MB since every file on my cluster takes 155 - 250
> MB.
>
> All of the above servers are exactly the same and having the following
> hardware and software:
> 1.7 GB memory
> 1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
> Ubuntu Server 10.10 , 32-bit platform
> Cloudera CDH3 Manual Hadoop Installation
> (for the ones who are familiar with Amazon Web Services, I am talking about
> Small EC2 Instances/Servers)
>
> Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
> MB and 10 reduce tasks).
>
> Based on the above information, does anyone can recommend on a best
> practice
> configuration??
> Do you thinks that when dealing with such a small cluster, and when
> processing such a small amount of data,
> is it even possible to optimize jobs so they would run much faster?
>
> By the way, it seems like none of the nodes are having a hardware
> performance issues (CPU/Memory) while running the job.
> Thats true unless I am having a bottle neck somewhere else (seems like
> network bandwidth is not the issue).
> That issue is a little confusing because  the NameNode process and the
> JobTracker process should allocate 1GB of memory each,
> which means that my hardware starting point is insufficient and in that
> case
> why am I not seeing a full Memory utilization using 'top'
> command on the NameNode & JobTracker Server?
> How would you recommend to measure/monitor different Hadoop's properties to
> find out where is the bottle neck?
>
> Thanks for your help!!
>
> Avi
>



-- 
Keren Ouaknine
Web: www.kereno.com

Re: Apache Hadoop optimization

Reply via email to