Massive performance problem with DistCp and -delete ---------------------------------------------------
Key: MAPREDUCE-1305 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1305 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distcp Affects Versions: 0.20.1 Reporter: Peter Romianowski Assignee: Peter Romianowski *First problem* In org.apache.hadoop.tools.DistCp#deleteNonexisting we serialize FileStatus objects when the path is all we need. The performance problem comes from org.apache.hadoop.fs.RawLocalFileSystem.RawLocalFileStatus#write which tries to retrieve file permissions by issuing a "ls -ld <path>" which is painfully slow. Changed that to just serialize Path and not FileStatus. *Second problem* To delete the files we invoke the "hadoop" command line tool with option "-rmr <path>". Again, for each file. Changed that to dstfs.delete(path, true) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.