[
https://issues.apache.org/jira/browse/LIVY-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajat Khandelwal updated LIVY-880:
----------------------------------
Description:
Livy's mechanism of loading third party jar into the interpreter is incorrect,
especially when there is a conflicting class in the third party jar.
By third party jars, I mean the jars you supply while creating session
{"name":"session-name", "kind":"spark", "jars":["hdfs://path/to/jar/1.jar"]}
Now when we have a conflict (the scenario where the jar is a fat jar and it is
bundling e.g. some older hadoop libs, or older jackson libs in it), we run into
a problem and the problem menifests in a weird way.
Let's say your jar has a class named `a.b.c.SomeClass`.
This is what I have observed
* create session goes through
* you're not able to import things from your jar. Like running code like
`import com.path.SomeClass` fails saying error: object path is not a member of
package com
* But you are able to load your classes from the jar by running code like
`Thread.currentThread.getContextClassLoader.loadClass("a.b.c.SomeClass")`
Essentially the classloaders are messed up. You can load a class by reflection,
but REPL has no idea of this class being in the classpath.
I have seen more reports of such a problem on jira, google-group,
stack-overflow etc. Mentioning a few:
*
https://stackoverflow.com/questions/65654752/getting-import-error-while-executing-statements-via-livy-sessions-with-emr
*
https://community.cloudera.com/t5/Support-Questions/How-to-import-External-Libraries-for-Livy-Interpreter-using/td-p/171812
*
https://community.cloudera.com/t5/Support-Questions/Livy-Spark-Rest-Jar-submission-interactive-session/td-p/302924
* https://groups.google.com/a/cloudera.org/g/hue-user/c/wR6d7gR_Avs
*
https://community.cloudera.com/t5/Community-Articles/Added-external-package-to-livy-causes-quot-console-25-quot/ta-p/245802
* https://issues.apache.org/jira/browse/LIVY-857
There is no definitive answer there. People have suggested these things
1. Adding jar to livy installation. Inside repl-jars
2. Adding jar to livy rsc-jars
3. Adding jar to hadoop installation on all nodes and using spark.
4. Use packages(group:artifact:version), not jars
We tried all these, the first two didn't work for us, the third did. But the
third mechanism is not ideal, because you're treating a third party jar as a
library jar (equivalent to hadoop/spark jars) and that is not always feasible
on prod systems.
The fourth mechanism is not always feasible, as livy only lets you specify
packages and not their repository locations.
Now, digging deeper, we figured out the cause and a potential solution.
Livy uses scala interpreter under the hood. Relevant classes [ILoop]
(https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/ILoop.scala#L646-L701)
and
[IMain](https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/IMain.scala#L251]
If you look at the first link, we see there are two methods in `ILoop`, both of
which are wrappers around `intp.addUrlsToClassPath`. The first wrapper
`addClasspath` is deprecated and the second wrapper `require` is recommended.
The `require` method does extra checks on the jar before actually calling
`intp.addUrlsToClassPath`. The checks are just for class-conflict. If there is
any class in the required-jars that conflicts with already loaded classes, it
won't be loaded. Scala’s REPLs class path is a bit fragile and does not allow
repeating classes defined within multiple jars. So they go around this issue by
exposing the `require` interface in the command line. By using `require`, the
user gets to know what is the conflict and they can take corrective action. If
we bypass `require` (which is what is happening in Livy's REPL code), we get
into this state where you can load classes through reflection but you can't
import them.
Now, to anyone looking for a workaround to this, just clean up your jar and
make sure you have as little conflicts with hadoop/scala/spark libraries. If
your lib depends on these, make them `provided` and don't bundle them in your
jar.
was:Livy's class-loading is incorrect in some scenarios, especially when
there is a conflict in the
> Loading third-party jar in the interpreter is buggy
> ---------------------------------------------------
>
> Key: LIVY-880
> URL: https://issues.apache.org/jira/browse/LIVY-880
> Project: Livy
> Issue Type: Bug
> Reporter: Rajat Khandelwal
> Priority: Major
>
> Livy's mechanism of loading third party jar into the interpreter is
> incorrect, especially when there is a conflicting class in the third party
> jar.
>
> By third party jars, I mean the jars you supply while creating session
>
> {"name":"session-name", "kind":"spark", "jars":["hdfs://path/to/jar/1.jar"]}
>
> Now when we have a conflict (the scenario where the jar is a fat jar and it
> is bundling e.g. some older hadoop libs, or older jackson libs in it), we run
> into a problem and the problem menifests in a weird way.
>
> Let's say your jar has a class named `a.b.c.SomeClass`.
>
> This is what I have observed
> * create session goes through
> * you're not able to import things from your jar. Like running code like
> `import com.path.SomeClass` fails saying error: object path is not a member
> of package com
> * But you are able to load your classes from the jar by running code like
> `Thread.currentThread.getContextClassLoader.loadClass("a.b.c.SomeClass")`
> Essentially the classloaders are messed up. You can load a class by
> reflection, but REPL has no idea of this class being in the classpath.
> I have seen more reports of such a problem on jira, google-group,
> stack-overflow etc. Mentioning a few:
> *
> https://stackoverflow.com/questions/65654752/getting-import-error-while-executing-statements-via-livy-sessions-with-emr
> *
> https://community.cloudera.com/t5/Support-Questions/How-to-import-External-Libraries-for-Livy-Interpreter-using/td-p/171812
> *
> https://community.cloudera.com/t5/Support-Questions/Livy-Spark-Rest-Jar-submission-interactive-session/td-p/302924
> * https://groups.google.com/a/cloudera.org/g/hue-user/c/wR6d7gR_Avs
> *
> https://community.cloudera.com/t5/Community-Articles/Added-external-package-to-livy-causes-quot-console-25-quot/ta-p/245802
> * https://issues.apache.org/jira/browse/LIVY-857
> There is no definitive answer there. People have suggested these things
> 1. Adding jar to livy installation. Inside repl-jars
> 2. Adding jar to livy rsc-jars
> 3. Adding jar to hadoop installation on all nodes and using spark.
> 4. Use packages(group:artifact:version), not jars
> We tried all these, the first two didn't work for us, the third did. But the
> third mechanism is not ideal, because you're treating a third party jar as a
> library jar (equivalent to hadoop/spark jars) and that is not always feasible
> on prod systems.
> The fourth mechanism is not always feasible, as livy only lets you specify
> packages and not their repository locations.
> Now, digging deeper, we figured out the cause and a potential solution.
> Livy uses scala interpreter under the hood. Relevant classes [ILoop]
> (https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/ILoop.scala#L646-L701)
> and
> [IMain](https://github.com/scala/scala/blob/a05d71a1ea33b265015794f71d12020d3f7ddd1f/src/repl/scala/tools/nsc/interpreter/IMain.scala#L251]
> If you look at the first link, we see there are two methods in `ILoop`, both
> of which are wrappers around `intp.addUrlsToClassPath`. The first wrapper
> `addClasspath` is deprecated and the second wrapper `require` is recommended.
> The `require` method does extra checks on the jar before actually calling
> `intp.addUrlsToClassPath`. The checks are just for class-conflict. If there
> is any class in the required-jars that conflicts with already loaded classes,
> it won't be loaded. Scala’s REPLs class path is a bit fragile and does not
> allow repeating classes defined within multiple jars. So they go around this
> issue by exposing the `require` interface in the command line. By using
> `require`, the user gets to know what is the conflict and they can take
> corrective action. If we bypass `require` (which is what is happening in
> Livy's REPL code), we get into this state where you can load classes through
> reflection but you can't import them.
> Now, to anyone looking for a workaround to this, just clean up your jar and
> make sure you have as little conflicts with hadoop/scala/spark libraries. If
> your lib depends on these, make them `provided` and don't bundle them in your
> jar.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)