[jira] [Updated] (SPARK-14638) Spark task does not have access to a dependency in the classloader of the executor thread

Younos Aboulnaga (JIRA) Fri, 13 May 2016 05:31:33 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Younos Aboulnaga updated SPARK-14638:
-------------------------------------
    Description: 
[Edit]
Editing the description because I can't comment on JIRA since yesterday, 
because of the anti-spam measure. Well, we got down to the problem, and it is 
not a Spark issue. tldr; the type of problem you would want to use containers 
to avoid, and we are going to be using Mesos moving forward. Explaining the 
details is going to be tough because there were too many moving parts. There is 
only one learning I can share about Spark: when you build spark with -Phive it 
[makes a copy of the Hadoop configurations into the Spark configurations 
folder|https://github.com/apache/spark/blob/v1.6.1/docs/sql-programming-guide.md#hive-tables],
 however those configurations are overridden by the configurations [inherited 
through 
HADOOP_CONF_DIR|https://github.com/apache/spark/blob/v1.6.1/docs/configuration.md#inheriting-hadoop-cluster-configuration].
 I might have gotten this the other way around, but make sure you use one or 
the other and do it consistently across your cluster. 

I am going to close the issue as Information Given, because Sean has been very 
generous with his time and his comments helped us pin point the root cause of 
the problem. Thank you very much Sean.
[End Edit]

We have started to frequently see Spark apps failing because of a 
NoClassDefFoundError thrown despite that the dependency had been added to the 
ClassLoader just before it was thrown. The [Executor.run method adds the 
JAR|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193]
 containing the class but then a NoClassDefFoundError is thrown subsequently. 
We see log messages from 
[updateDependencies|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L386]
 indicating that the JAR is fetched and added to the class loader. Upon 
inspection of the worker dir, the JAR is there, it is not corrupted, and it 
contains the class that could not be found in the class loader. 

We first saw this when we started writing streaming apps, and we thought it is 
something specific to streaming apps. However, this was wrong as the same 
problem happened with several batch apps. 

We first saw this on a Standalone cluster, and we though that it might be a 
problem caused by the lack of resource manager. Now we installed Mesos and the 
problem still happens. 

I tried to create a POC Spark App that demonstrates the problem, but I couldn't 
reliably reproduce it. The problem would still happen in other apps, but it 
didn't happen in the POC app even though I made it structurally the same as any 
other app we run. The problem seems to be environmental, specially because we 
found a work around for it.

The work around we found is setting SPARK_CLASSPATH *on the executor nodes* to 
a local copy of the dependency. The problem still happens if we set the 
'spark.executor.extraClassPath' or 'spark.driver.extraClassPath' or set 
SPARK_CLASSPATH on the driver node. However, if the SPARK_CLASSPATH is set on 
the executor node, then the problem doesn't happen because the JAR doesn't need 
to be added to the class loader by Executor#updateDependencies.

Other symptoms of the problem are the following:

1) Even though there is a 'log4j.properties' in the 
'spark.executor.extraClassPath', the first line of the stderr of the worker 
says "Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties" The log4j.properties file that is 
shipped with the job is totally neglected. 

2) Any configuration files on 'spark.executor.extraClassPath' are neglected. I 
am mentioning this because log4j.properties is loaded very early on and in a 
static call, which might sway the troubleshooting into wrong directions.

Here is the specific example in our case:

> grep NoClassDef workers/app-20160414111328-0043/0/stderr

Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil
.. SEVERAL ATTEMPTS ...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil 

Even though, in the same application worker dir:
> for j in workers/app-20160414111328-0043/0/*.jar ; do jar tf $j | grep 
> ProtobufUtil ; done;

org/apache/hadoop/hbase/protobuf/ProtobufUtil$1.class
org/apache/hadoop/hbase/protobuf/ProtobufUtil.class

There are other examples, specially for configurations not being found. I think 
the SPARK-12279 can also be caused by  the same root cause.

We have been seeing this in several of our clusters and several engineers had 
spent days looking into why their applications suffer from this. We rebuilt our 
infrastructure (always on AWS EC2 nodes) and tested many hypotheses, including 
things that are non-sensical, and we still can't find anything that reliably 
reproduces the problem. The only reliable piece of information is that setting 
SPARK_CLASSPATH *on the executor nodes* prevents the problem from happening, 
because then the dependencies are included in the -cp parameter of the java 
command running the CoarseGrainedExecutorBackend .

We would appreciate if someone more knowledgeable in Spark internals take a 
look, and we can help by providing as much details as possible.

  was:
We have started to frequently see Spark apps failing because of a 
NoClassDefFoundError thrown despite that the dependency had been added to the 
ClassLoader just before it was thrown. The [Executor.run method adds the 
JAR|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193]
 containing the class but then a NoClassDefFoundError is thrown subsequently. 
We see log messages from 
[updateDependencies|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L386]
 indicating that the JAR is fetched and added to the class loader. Upon 
inspection of the worker dir, the JAR is there, it is not corrupted, and it 
contains the class that could not be found in the class loader. 

We first saw this when we started writing streaming apps, and we thought it is 
something specific to streaming apps. However, this was wrong as the same 
problem happened with several batch apps. 

We first saw this on a Standalone cluster, and we though that it might be a 
problem caused by the lack of resource manager. Now we installed Mesos and the 
problem still happens. 

I tried to create a POC Spark App that demonstrates the problem, but I couldn't 
reliably reproduce it. The problem would still happen in other apps, but it 
didn't happen in the POC app even though I made it structurally the same as any 
other app we run. The problem seems to be environmental, specially because we 
found a work around for it.

The work around we found is setting SPARK_CLASSPATH *on the executor nodes* to 
a local copy of the dependency. The problem still happens if we set the 
'spark.executor.extraClassPath' or 'spark.driver.extraClassPath' or set 
SPARK_CLASSPATH on the driver node. However, if the SPARK_CLASSPATH is set on 
the executor node, then the problem doesn't happen because the JAR doesn't need 
to be added to the class loader by Executor#updateDependencies.

Other symptoms of the problem are the following:

1) Even though there is a 'log4j.properties' in the 
'spark.executor.extraClassPath', the first line of the stderr of the worker 
says "Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties" The log4j.properties file that is 
shipped with the job is totally neglected. 

2) Any configuration files on 'spark.executor.extraClassPath' are neglected. I 
am mentioning this because log4j.properties is loaded very early on and in a 
static call, which might sway the troubleshooting into wrong directions.

Here is the specific example in our case:

> grep NoClassDef workers/app-20160414111328-0043/0/stderr

Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil
.. SEVERAL ATTEMPTS ...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hbase.protobuf.ProtobufUtil 

Even though, in the same application worker dir:
> for j in workers/app-20160414111328-0043/0/*.jar ; do jar tf $j | grep 
> ProtobufUtil ; done;

org/apache/hadoop/hbase/protobuf/ProtobufUtil$1.class
org/apache/hadoop/hbase/protobuf/ProtobufUtil.class

There are other examples, specially for configurations not being found. I think 
the SPARK-12279 can also be caused by  the same root cause.

We have been seeing this in several of our clusters and several engineers had 
spent days looking into why their applications suffer from this. We rebuilt our 
infrastructure (always on AWS EC2 nodes) and tested many hypotheses, including 
things that are non-sensical, and we still can't find anything that reliably 
reproduces the problem. The only reliable piece of information is that setting 
SPARK_CLASSPATH *on the executor nodes* prevents the problem from happening, 
because then the dependencies are included in the -cp parameter of the java 
command running the CoarseGrainedExecutorBackend .

We would appreciate if someone more knowledgeable in Spark internals take a 
look, and we can help by providing as much details as possible.


> Spark task does not have access to a dependency in the classloader of the 
> executor thread
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-14638
>                 URL: https://issues.apache.org/jira/browse/SPARK-14638
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.1, 1.4.1, 1.6.0, 1.6.1
>         Environment: > uname -a
> Linux HOSTNAME 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 
> x86_64 x86_64 x86_64 GNU/Linux
> > java -version
> java version "1.8.0_77"
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>            Reporter: Younos Aboulnaga
>
> [Edit]
> Editing the description because I can't comment on JIRA since yesterday, 
> because of the anti-spam measure. Well, we got down to the problem, and it is 
> not a Spark issue. tldr; the type of problem you would want to use containers 
> to avoid, and we are going to be using Mesos moving forward. Explaining the 
> details is going to be tough because there were too many moving parts. There 
> is only one learning I can share about Spark: when you build spark with 
> -Phive it [makes a copy of the Hadoop configurations into the Spark 
> configurations 
> folder|https://github.com/apache/spark/blob/v1.6.1/docs/sql-programming-guide.md#hive-tables],
>  however those configurations are overridden by the configurations [inherited 
> through 
> HADOOP_CONF_DIR|https://github.com/apache/spark/blob/v1.6.1/docs/configuration.md#inheriting-hadoop-cluster-configuration].
>  I might have gotten this the other way around, but make sure you use one or 
> the other and do it consistently across your cluster. 
> I am going to close the issue as Information Given, because Sean has been 
> very generous with his time and his comments helped us pin point the root 
> cause of the problem. Thank you very much Sean.
> [End Edit]
> We have started to frequently see Spark apps failing because of a 
> NoClassDefFoundError thrown despite that the dependency had been added to the 
> ClassLoader just before it was thrown. The [Executor.run method adds the 
> JAR|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193]
>  containing the class but then a NoClassDefFoundError is thrown subsequently. 
> We see log messages from 
> [updateDependencies|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L386]
>  indicating that the JAR is fetched and added to the class loader. Upon 
> inspection of the worker dir, the JAR is there, it is not corrupted, and it 
> contains the class that could not be found in the class loader. 
> We first saw this when we started writing streaming apps, and we thought it 
> is something specific to streaming apps. However, this was wrong as the same 
> problem happened with several batch apps. 
> We first saw this on a Standalone cluster, and we though that it might be a 
> problem caused by the lack of resource manager. Now we installed Mesos and 
> the problem still happens. 
> I tried to create a POC Spark App that demonstrates the problem, but I 
> couldn't reliably reproduce it. The problem would still happen in other apps, 
> but it didn't happen in the POC app even though I made it structurally the 
> same as any other app we run. The problem seems to be environmental, 
> specially because we found a work around for it.
> The work around we found is setting SPARK_CLASSPATH *on the executor nodes* 
> to a local copy of the dependency. The problem still happens if we set the 
> 'spark.executor.extraClassPath' or 'spark.driver.extraClassPath' or set 
> SPARK_CLASSPATH on the driver node. However, if the SPARK_CLASSPATH is set on 
> the executor node, then the problem doesn't happen because the JAR doesn't 
> need to be added to the class loader by Executor#updateDependencies.
> Other symptoms of the problem are the following:
> 1) Even though there is a 'log4j.properties' in the 
> 'spark.executor.extraClassPath', the first line of the stderr of the worker 
> says "Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties" The log4j.properties file that is 
> shipped with the job is totally neglected. 
> 2) Any configuration files on 'spark.executor.extraClassPath' are neglected. 
> I am mentioning this because log4j.properties is loaded very early on and in 
> a static call, which might sway the troubleshooting into wrong directions.
> Here is the specific example in our case:
> > grep NoClassDef workers/app-20160414111328-0043/0/stderr
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil
> .. SEVERAL ATTEMPTS ...
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil 
> Even though, in the same application worker dir:
> > for j in workers/app-20160414111328-0043/0/*.jar ; do jar tf $j | grep 
> > ProtobufUtil ; done;
> org/apache/hadoop/hbase/protobuf/ProtobufUtil$1.class
> org/apache/hadoop/hbase/protobuf/ProtobufUtil.class
> There are other examples, specially for configurations not being found. I 
> think the SPARK-12279 can also be caused by  the same root cause.
> We have been seeing this in several of our clusters and several engineers had 
> spent days looking into why their applications suffer from this. We rebuilt 
> our infrastructure (always on AWS EC2 nodes) and tested many hypotheses, 
> including things that are non-sensical, and we still can't find anything that 
> reliably reproduces the problem. The only reliable piece of information is 
> that setting SPARK_CLASSPATH *on the executor nodes* prevents the problem 
> from happening, because then the dependencies are included in the -cp 
> parameter of the java command running the CoarseGrainedExecutorBackend .
> We would appreciate if someone more knowledgeable in Spark internals take a 
> look, and we can help by providing as much details as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-14638) Spark task does not have access to a dependency in the classloader of the executor thread

Reply via email to