Hi All,
     I have around 2.5 GB of data which is present in S3. To run EMR
jobs on this data, I am dowloading the data from S3 to HDFS using

hadoop distcp s3://<LOCATION> /tmp/

I am using 9 c1.xlarge (8 virtual cores with 2.5 EC2 Compute Units
each) which basically means that I have 72 cores available.
Hadoop is taking nearly 7 minutes to execute the above command where
the actually MapReduce job for distcp started after 5 minutes.

I tried to increase map tasks using "-m" option. But it is till taking
7 minutes. Can some one suggest me what is the best way to download
data from S3 to HDFS making use of all the available machines.


-----
Thanks,
Thulasi Ram P

Reply via email to