Github user jkleckner commented on the pull request:
https://github.com/apache/spark/pull/4780#issuecomment-76767297
> You're definitely sure CA is in your app JAR? I ask just because you
mention it was marked provided above, though also in your assembly. Worth
double-checking.
Oh yes, it shows as completely there (sbt doesn't minimize) via ```jar
tf``` on the assembly. Also all of fastutil...
> Are you using Spark 1.3? because it now occurs to me that
spark.driver.userClassPathFirst and spark.executor.userClassPathFirst don't
exist before 1.3. The equivalents are the less-well-named
spark.files.userClassPathFirst and spark.yarn.user.classpath.first (for YARN
mode). Can you try those instead if you're using <= 1.2?
Ouch. Got bitten by the inconsistency and bugs among 1.2.0 local vs YARN
and 1.3. Am using 1.2.0a on Amazon EMR. My inner loop testing was with local
and this bug https://issues.apache.org/jira/browse/SPARK-4739 indicates that it
won't work with local. On the presumption that unused options can't hurt and
to cover the bases on old and new option flag settings, I changed my script to
use:
```bash
--conf spark.files.userClassPathFirst=true \
--conf spark.driver.userClassPathFirst=true\
--conf spark.executor.userClassPathFirst=true\
--conf spark.yarn.user.classpath.first=true\
--conf spark.files.user.classpath.first=true \
```
So I retried the case where CA and fastutil are complete (and not renamed
or shaded) in our user assembly. I can confirm that both local and YARN get
the class not found exception ```java.lang.NoClassDefFoundError:
it/unimi/dsi/fastutil/longs/Long2LongOpenHashMap``` even with the user class
path first settings. Can you confirm that the correct behavior would be for
the CA and fastutil classes to be found in our assembly when using user jar
first even if some of of CA is contained in the spark jar? Or does this
require shading?
> So, I would close this PR in the sense that this isn't the fix. I would
leave SPARK-6029 open though until there's a resolution to the issue one way or
the other.
First, please comment on the following. Philosophically, it is not clear
that the ```minimizeJar``` option used to create the Spark assembly is
consistent with expecting people to mark it as ```provided```. If, as in this
case, a broader set of classes is required by a downstream jar, it creates the
sort of conflict we are seeing. In some sense, the dependent package is, in
fact, not ```provided```, at least not fully. I'm surprised this hasn't
happened in more cases.
> You're effectively shading CA (and not fastutil); you should also be able
to achieve that through your build rather than bother with source, though, I
don't know how that works in SBT. (minimizeJar is a function of Maven's shading
plugin.)
>
> fastutil-in-Spark isn't the issue per se, since indeed Spark doesn't have
it! what it does have is CA.
Yes, there is a rename feature in sbt which should accomplish the shading
without manually renaming. There does not appear to be a ```minimizeJar```
equivalent. I will give that renaming a try. Do you think that Spark should
be the one to rename CA if it is cherry picking the HyperLogLog class and
nothing else? It creates headaches for other dependent analytics packages
which could declare a dependence on CA without conflicts if it did.
Your pointers to the other JIRA bugs do seem relevant and to have some
mechanisms in common.
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5879.html
> * for a class that exists in both my jar and spark kernel it tries to use
userClassLoader and ends up with a NoClassDefFoundError. the class is
org.apache.avro.mapred.AvroInputFormat and the NoClassDefFoundError is for
org.apache.hadoop.mapred.FileInputFormat (which the parentClassLoader is
responsible for since it is not in my jar). i currently catch this
NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly not
a solution since it loads the wrong version.
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5875.html
> ok i think the issue is visibility: a classloader can see all classes
loaded by its parent classloader. but userClassLoader does not have a parent
classloader, so its not able to "see" any classes that parentLoader is
responsible for. in my case userClassLoader is trying to get AvroInputFormat
which probably somewhere statically references FileInputFormat, which is
invisible to userClassLoader.
These two may explain why putting our jar first still doesn't work without
shading. I confess that I am not an expert on class loaders...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]