Hi all, A few questions regarding distcp. Note we are trying to distcp from "normal" unpatched hadoop 0.20.1 to CDH3 hadoop, so we are starting distcp from the CDH3 cluster and using hftp for source url.
1) Our new cluster has 25 machines but 100 mappers. When distcp is triggered, it seems to allocate 4 mappers per machine. Is this normal? The issue here is that say distcp only needs 8 mappers, I would think that distcp would try to distribute those to different machines so that perhaps IO will not be saturated on one machine. What I've been seeing is that for those 8 map tasks, 4 are assigned to one machine and 4 to the other, as opposed to 8 being assigned do a different machine altogether. 2) Distcp cannot get the _logs directory. I keep getting this error: 2010-08-15 02:26:19,179 INFO org.apache.hadoop.tools.DistCp: FAIL _logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%2Fmaster%2F201005%2Fyou : java.io.IOException: Server returned HTTP response code: 500 for URL: http://mi-prod-app05:50075/streamFile?filename=/master/201005/youtube/_logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%252Fmaster%252F201005%252Fyou&ugi=hadoop,hadoop at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Other than using the flag -i to "ignore" this, is there another workaround? I tried to download that file to local, and it works fine, so it's not that the data does not exist. Is this in any way related to https://issues.apache.org/jira/browse/MAPREDUCE-968? Thanks! Kris Jirapinyo Software Engineer Attensity 1400 Bridge Parkway Ste 202 Redwood City, CA 94065 www.attensity.com<http://www.attensity.com/> WELCOME TO THE OPEN ENTERPRISE Follow us: twitter<http://twitter.com/attensity> facebook<http://www.facebook.com/attensity> blog<http://blog.attensity.com/>
