[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact

Patrick Wendell (JIRA) Sun, 08 Jun 2014 15:15:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14021465#comment-14021465
 ]


Patrick Wendell edited comment on SPARK-2075 at 6/8/14 10:13 PM:
-----------------------------------------------------------------

Okay I did some more digging. I think the issue is that the anonymous classes 
used by saveAsTextFile are not guaranteed to be compiled to the same name every 
time you compile them in Scala. In the Hadoop 1 build these end up being 
shortened wheras in the Hadoop 2 build they use the longer names. 
saveAsTextFile seems to, strangely, be the only affected function. I confirmed 
this by looking at the difference in the hadoop 1 and 2 jars:

{code}
$ jar tvf spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop*.jar |grep 
"rdd\/RDD\\$" | awk '{ print $8;}' | sort > hadoop1
$ jar tvf spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop*.jar |grep 
"rdd\/RDD\\$" | awk '{ print $8;}' | sort > hadoop2
$ diff hadoop1 hadoop2
23a24
> org/apache/spark/rdd/RDD$$anonfun$28$$anonfun$apply$13.class
27,29d27
< org/apache/spark/rdd/RDD$$anonfun$30$$anonfun$apply$13.class
< org/apache/spark/rdd/RDD$$anonfun$30.class
< org/apache/spark/rdd/RDD$$anonfun$31.class
90a89,90
> org/apache/spark/rdd/RDD$$anonfun$saveAsTextFile$1.class
> org/apache/spark/rdd/RDD$$anonfun$saveAsTextFile$2.class
{code}

This strangely only seems to affect the saveAsTextFile function.

I'm still a bit confused though because I didn't think these anonymous classes 
would show up in the byte code of the user application, so I don't think it 
should matter (i.e. this is why Scala probably allows this).

{code}
javap RDD | grep saveAsText
  public void saveAsTextFile(java.lang.String);
  public void saveAsTextFile(java.lang.String, java.lang.Class<? extends 
org.apache.hadoop.io.compress.CompressionCodec>);
{code}

[~paulrbrown] could you explain how you are bundling and submitting your 
application to the Spark cluster?


was (Author: pwendell):
Okay I did some more digging. I think the issue is that the anonymous classes 
used by saveAsTextFile are not guaranteed to be compiled to the same name every 
time you compile them in Scala. In the Hadoop 1 build these end up being 
shortened wheras in the Hadoop 2 build they use the longer names. 
saveAsTextFile seems to, strangely, be the only affected function. I confirmed 
this by looking at the difference in the hadoop 1 and 2 jars:

{code}
$ jar tvf spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop*.jar |grep 
"rdd\/RDD\\$" | awk '{ print $8;}' | sort > hadoop1
$ jar tvf spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop*.jar |grep 
"rdd\/RDD\\$" | awk '{ print $8;}' | sort > hadoop2
$ diff hadoop1 hadoop2
23a24
> org/apache/spark/rdd/RDD$$anonfun$28$$anonfun$apply$13.class
27,29d27
< org/apache/spark/rdd/RDD$$anonfun$30$$anonfun$apply$13.class
< org/apache/spark/rdd/RDD$$anonfun$30.class
< org/apache/spark/rdd/RDD$$anonfun$31.class
90a89,90
> org/apache/spark/rdd/RDD$$anonfun$saveAsTextFile$1.class
> org/apache/spark/rdd/RDD$$anonfun$saveAsTextFile$2.class
{code}

This strangely only seems to affect the saveAsTextFile function.

I'm still a bit confused though because I didn't think these anonymous classes 
would show up in the byte code of the user application, so I don't think it 
should matter (i.e. this is why Scala probably allows this).

[~paulrbrown] could you explain how you are bundling and submitting your 
application to the Spark cluster?

> Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
> 1.0.0 artifact
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2075
>                 URL: https://issues.apache.org/jira/browse/SPARK-2075
>             Project: Spark
>          Issue Type: Bug
>          Components: Build, Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Paul R. Brown
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact

Reply via email to