Github user jkleckner commented on the pull request:

    https://github.com/apache/spark/pull/4780#issuecomment-76767297
  
    > You're definitely sure CA is in your app JAR? I ask just because you 
mention it was marked provided above, though also in your assembly. Worth 
double-checking.
    
    Oh yes, it shows as completely there (sbt doesn't minimize) via ```jar 
tf``` on the assembly.  Also all of fastutil...
    
    > Are you using Spark 1.3? because it now occurs to me that 
spark.driver.userClassPathFirst and spark.executor.userClassPathFirst don't 
exist before 1.3. The equivalents are the less-well-named 
spark.files.userClassPathFirst and spark.yarn.user.classpath.first (for YARN 
mode). Can you try those instead if you're using <= 1.2?
    
    Ouch.  Got bitten by the inconsistency and bugs among 1.2.0 local vs YARN 
and 1.3.  Am using 1.2.0a on Amazon EMR.  My inner loop testing was with local 
and this bug https://issues.apache.org/jira/browse/SPARK-4739 indicates that it 
won't work with local.  On the presumption that unused options can't hurt and 
to cover the bases on old and new option flag settings, I changed my script to 
use:
    ```bash
     --conf spark.files.userClassPathFirst=true \
     --conf spark.driver.userClassPathFirst=true\
     --conf spark.executor.userClassPathFirst=true\
     --conf spark.yarn.user.classpath.first=true\
     --conf spark.files.user.classpath.first=true \
    ```
    
    So I retried the case where CA and fastutil are complete (and not renamed 
or shaded) in our user assembly.  I can confirm that both local and YARN get 
the class not found exception ```java.lang.NoClassDefFoundError: 
it/unimi/dsi/fastutil/longs/Long2LongOpenHashMap``` even with the user class 
path first settings.  Can you confirm that the correct behavior would be for 
the CA and fastutil classes to be found in our assembly when using user jar 
first even if some of of CA is contained in the spark jar?  Or does this 
require shading?
    
    > So, I would close this PR in the sense that this isn't the fix. I would 
leave SPARK-6029 open though until there's a resolution to the issue one way or 
the other.
    
    First, please comment on the following.  Philosophically, it is not clear 
that the ```minimizeJar``` option used to create the Spark assembly is 
consistent with expecting people to mark it as ```provided```.  If, as in this 
case, a broader set of classes is required by a downstream jar, it creates the 
sort of conflict we are seeing.  In some sense, the dependent package is, in 
fact, not ```provided```, at least not fully.  I'm surprised this hasn't 
happened in more cases.
    
    > You're effectively shading CA (and not fastutil); you should also be able 
to achieve that through your build rather than bother with source, though, I 
don't know how that works in SBT. (minimizeJar is a function of Maven's shading 
plugin.)
    > 
    > fastutil-in-Spark isn't the issue per se, since indeed Spark doesn't have 
it! what it does have is CA.
    
    Yes, there is a rename feature in sbt which should accomplish the shading 
without manually renaming.  There does not appear to be a ```minimizeJar``` 
equivalent.  I will give that renaming a try.  Do you think that Spark should 
be the one to rename CA if it is cherry picking the HyperLogLog class and 
nothing else?  It creates headaches for other dependent analytics packages 
which could declare a dependence on CA without conflicts if it did.
    
    
    Your pointers to the other JIRA bugs do seem relevant and to have some 
mechanisms in common.
    
    
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5879.html
    
    > * for a class that exists in both my jar and spark kernel it tries to use 
userClassLoader and ends up with a NoClassDefFoundError. the class is 
org.apache.avro.mapred.AvroInputFormat and the NoClassDefFoundError is for 
org.apache.hadoop.mapred.FileInputFormat (which the parentClassLoader is 
responsible for since it is not in my jar). i currently catch this 
NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly not 
a solution since it loads the wrong version.
    
    
    
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5875.html
    > ok i think the issue is visibility: a classloader can see all classes 
loaded by its parent classloader. but userClassLoader does not have a parent 
classloader, so its not able to "see" any classes that parentLoader is 
responsible for. in my case userClassLoader is trying to get AvroInputFormat 
which probably somewhere statically references FileInputFormat, which is 
invisible to userClassLoader.
    
    These two may explain why putting our jar first still doesn't work without 
shading.  I confess that I am not an expert on class loaders...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to