Hi All, I have around 2.5 GB of data which is present in S3. To run EMR jobs on this data, I am dowloading the data from S3 to HDFS using
hadoop distcp s3://<LOCATION> /tmp/ I am using 9 c1.xlarge (8 virtual cores with 2.5 EC2 Compute Units each) which basically means that I have 72 cores available. Hadoop is taking nearly 7 minutes to execute the above command where the actually MapReduce job for distcp started after 5 minutes. I tried to increase map tasks using "-m" option. But it is till taking 7 minutes. Can some one suggest me what is the best way to download data from S3 to HDFS making use of all the available machines. ----- Thanks, Thulasi Ram P