Alejandro Fernandez created AMBARI-9990:
-------------------------------------------
Summary: CopyFromLocal failed to copy Tez tarball to HDFS failed
because multiple processes tried to copy to the same destination simultaneously
Key: AMBARI-9990
URL: https://issues.apache.org/jira/browse/AMBARI-9990
Project: Ambari
Issue Type: Bug
Components: ambari-server
Affects Versions: 2.0.0
Reporter: Alejandro Fernandez
Assignee: Alejandro Fernandez
Fix For: 2.0.0
Pig Service Check and Hive Server 2 START ran on 2 different machines during
the stack installation and failed to copy the tez tarball to HDFS.
I was able to reproduce this locally by calling CopyFromLocal from two clients
simultaneously. See the HDFS audit log, datanode logs on c6408 & c6410, and
namenode log on c6410.
The copyFromLocal command's behavior is:
* Try to create a temporary file <filename>._COPYING_ and write the real data
there
* If hit any exception, delete the file with the name <filename>._COPYING_
Thus we have the following race condition in this test:
Process P1 created file "tez.tar.gz._COPYING_" and wrote data to it
Process P2 fired the same copyFromLocal command and hit exception because it
could not get the lease
P2 then deleted the file "tez.tar.gz._COPYING_"
P1 could not close the file "tez.tar.gz._COPYING_" since it had been deleted by
P2. The exception would say "could not find lease for file..."
In general we do not have the correct synchronization guarantee for the
"copyFromLocal" command.
One solution is for the destination file name to be unique. Because the mv
command is synchronized by the namenode, at least one of them will succeed in
naming the file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)