distcp questions

Kris Jirapinyo Sun, 15 Aug 2010 10:35:16 -0700

Hi all,
   A few questions regarding distcp.  Note we are trying to distcp from 
"normal" unpatched hadoop 0.20.1 to CDH3 hadoop, so we are starting distcp from 
the CDH3 cluster and using hftp for source url.


1) Our new cluster has 25 machines but 100 mappers.  When distcp is triggered, 
it seems to allocate 4 mappers per machine.  Is this normal? The issue here is 
that say distcp only needs 8 mappers, I would think that distcp would try to 
distribute those to different machines so that perhaps IO will not be saturated 
on one machine.  What I've been seeing is that for those 8 map tasks, 4 are 
assigned to one machine and 4 to the other, as opposed to 8 being assigned do a 
different machine altogether.

2) Distcp cannot get the _logs directory.  I keep getting this error:

2010-08-15 02:26:19,179 INFO org.apache.hadoop.tools.DistCp: FAIL 
_logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%2Fmaster%2F201005%2Fyou
 : java.io.IOException: Server returned HTTP response code: 500 for URL: 
http://mi-prod-app05:50075/streamFile?filename=/master/201005/youtube/_logs/history/mi-prod-app01.ec2.biz360.com_1273881751016_job_201005141702_51820_hadoop_com.biz360.jobs.DateFilterMerge+%252Fmaster%252F201005%252Fyou&ugi=hadoop,hadoop
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
        at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537)
        at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Other than using the flag -i to "ignore" this, is there another workaround? I 
tried to download that file to local, and it works fine, so it's not that the 
data does not exist.  Is this in any way related to 
https://issues.apache.org/jira/browse/MAPREDUCE-968?

Thanks!
Kris Jirapinyo
Software Engineer
Attensity
1400 Bridge Parkway Ste 202
Redwood City, CA 94065
www.attensity.com<http://www.attensity.com/>
WELCOME TO THE OPEN ENTERPRISE
Follow us: twitter<http://twitter.com/attensity> 
facebook<http://www.facebook.com/attensity> blog<http://blog.attensity.com/>

distcp questions

Reply via email to