Re: Application failure in yarn-cluster mode
Hi, I have been able to reproduce this problem on our dev environment, I am fairly sure now that it is indeed a bug. As a consequence, I have created a JIRA (SPARK-3967https://issues.apache.org/jira/browse/SPARK-3967) for this issue, which is triggered when yarn.nodemanager.local-dirs (not hadoop.tmp.dir, as I said below) is set to a comma-separated list of directories which are located on different disks/partitions. Regards, Christophe. On 14/10/2014 09:37, Christophe Préaud wrote: Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can anybody confirm if it is a bug, or a (configuration?) problem on my side? Thanks, Christophe. On 10/10/2014 18:24, Christophe Préaud wrote: Hi, After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed most of the time (but not always) in yarn-cluster mode (but not in yarn-client mode). Here is my configuration: * spark-1.1.0 * hadoop-2.2.0 And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each directory is on its own partition, on different disks): property namehadoop.tmp.dir/name valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value /property After investigating, it turns out that the problem is when the executor fetches a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local (see hadoop.tmp.dir definition above), and then moved in one of the temporary directory defined in hadoop.tmp.dir: * if it is the same than the temporary file (i.e. /d1/yarn/local), then the application continues normally * if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails with the following error: 14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at com.google.common.io.Files$FileByteSink.openStream(Files.java:223) at com.google.common.io.Files$FileByteSink.openStream(Files.java:211) at com.google.common.io.ByteSource.copyTo(ByteSource.java:203) at com.google.common.io.Files.copy(Files.java:436) at com.google.common.io.Files.move(Files.java:651) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea why the move fails when the source and target files are not on the same partition, for the moment I have worked around the problem with the attached patch (i.e. I ensure that the temp file and the moved file are always on the same partition). Any thought about this problem? Thanks! Christophe. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de €
Re: Application failure in yarn-cluster mode
Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can anybody confirm if it is a bug, or a (configuration?) problem on my side? Thanks, Christophe. On 10/10/2014 18:24, Christophe Préaud wrote: Hi, After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed most of the time (but not always) in yarn-cluster mode (but not in yarn-client mode). Here is my configuration: * spark-1.1.0 * hadoop-2.2.0 And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each directory is on its own partition, on different disks): property namehadoop.tmp.dir/name valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value /property After investigating, it turns out that the problem is when the executor fetches a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local (see hadoop.tmp.dir definition above), and then moved in one of the temporary directory defined in hadoop.tmp.dir: * if it is the same than the temporary file (i.e. /d1/yarn/local), then the application continues normally * if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails with the following error: 14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at com.google.common.io.Files$FileByteSink.openStream(Files.java:223) at com.google.common.io.Files$FileByteSink.openStream(Files.java:211) at com.google.common.io.ByteSource.copyTo(ByteSource.java:203) at com.google.common.io.Files.copy(Files.java:436) at com.google.common.io.Files.move(Files.java:651) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea why the move fails when the source and target files are not on the same partition, for the moment I have worked around the problem with the attached patch (i.e. I ensure that the temp file and the moved file are always on the same partition). Any thought about this problem? Thanks! Christophe. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Application failure in yarn-cluster mode
Hi, After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed most of the time (but not always) in yarn-cluster mode (but not in yarn-client mode). Here is my configuration: * spark-1.1.0 * hadoop-2.2.0 And the hadoop.tmp.dir definition in the hadoop core-site.xml file (each directory is on its own partition): property namehadoop.tmp.dir/name valuefile:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,file:/d5/yarn/local,file:/d6/yarn/local,file:/d7/yarn/local/value /property After investigating, it turns out that the problem is when the executor fetches a jar file: the jar is downloaded in a temporary file, always in /d1/yarn/local (see hadoop.tmp.dir definition above), and then moved in one of the temporary directory defined in hadoop.tmp.dir: * if it is the same than the temporary file (i.e. /d1/yarn/local), then the application continues normally * if it is another one (i.e. /d2/yarn/local, /d3/yarn/local,...), it fails with the following error: 14/10/10 14:33:51 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: ./logReader-1.0.10.jar (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at com.google.common.io.Files$FileByteSink.openStream(Files.java:223) at com.google.common.io.Files$FileByteSink.openStream(Files.java:211) at com.google.common.io.ByteSource.copyTo(ByteSource.java:203) at com.google.common.io.Files.copy(Files.java:436) at com.google.common.io.Files.move(Files.java:651) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:440) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:325) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:323) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:323) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea why the move fails when the source and target files are not on the same partition, for the moment I have worked around the problem with the attached patch (i.e. I ensure that the temp file and the moved file are always on the same partition). Any thought about this problem? Thanks! Christophe. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. --- core/src/main/scala/org/apache/spark/util/Utils.scala.orig 2014-09-03 08:00:33.0 +0200 +++ core/src/main/scala/org/apache/spark/util/Utils.scala 2014-10-10 17:51:59.0 +0200 @@ -349,8 +349,7 @@ */ def fetchFile(url: String, targetDir: File, conf: SparkConf, securityMgr: SecurityManager) { val filename = url.split(/).last -val tempDir = getLocalDir(conf) -val tempFile = File.createTempFile(fetchFileTemp, null, new File(tempDir)) +val tempFile = File.createTempFile(fetchFileTemp, null, new File(targetDir.getAbsolutePath)) val targetFile = new File(targetDir, filename) val uri = new URI(url) val fileOverwrite = conf.getBoolean(spark.files.overwrite, false) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org