[
https://issues.apache.org/jira/browse/AMBARI-9990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alejandro Fernandez updated AMBARI-9990:
----------------------------------------
Attachment: AMBARI-9990.patch
> CopyFromLocal failed to copy Tez tarball to HDFS failed because multiple
> processes tried to copy to the same destination simultaneously
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: AMBARI-9990
> URL: https://issues.apache.org/jira/browse/AMBARI-9990
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.0.0
> Reporter: Alejandro Fernandez
> Assignee: Alejandro Fernandez
> Fix For: 2.0.0
>
> Attachments: AMBARI-9990.patch,
> hadoop-hdfs-datanode-c6408.ambari.apache.org.log,
> hadoop-hdfs-datanode-c6410.ambari.apache.org.log,
> hadoop-hdfs-namenode-c6408.ambari.apache.org.log, hdfs-audit.log
>
>
> Pig Service Check and Hive Server 2 START ran on 2 different machines during
> the stack installation and failed to copy the tez tarball to HDFS.
> I was able to reproduce this locally by calling CopyFromLocal from two
> clients simultaneously. See the HDFS audit log, datanode logs on c6408 &
> c6410, and namenode log on c6410.
> The copyFromLocal command's behavior is:
> * Try to create a temporary file <filename>._COPYING_ and write the real data
> there
> * If hit any exception, delete the file with the name <filename>._COPYING_
> Thus we have the following race condition in this test:
> Process P1 created file "tez.tar.gz._COPYING_" and wrote data to it
> Process P2 fired the same copyFromLocal command and hit exception because it
> could not get the lease
> P2 then deleted the file "tez.tar.gz._COPYING_"
> P1 could not close the file "tez.tar.gz._COPYING_" since it had been deleted
> by P2. The exception would say "could not find lease for file..."
> In general we do not have the correct synchronization guarantee for the
> "copyFromLocal" command.
> One solution is for the destination file name to be unique. Because the mv
> command is synchronized by the namenode, at least one of them will succeed in
> naming the file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)