Re: Calling external classes added by sc.addJar needs to be through reflection

Xiangrui Meng Thu, 22 May 2014 01:35:19 -0700

Hi DB,

I found it is a little hard to implement the solution I mentioned:


> Do not send the primary jar and secondary jars to executors'
> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
> and serve them via http by called sc.addJar in SparkContext.

If you look at ApplicationMaster code, which is entry point in
yarn-cluster mode. It actually creates a thread of the user class
first and waits the user class to create a spark context. It means the
user class has to be on the classpath at that time. I think we need to
add the primary jar and secondary jars twice, once to system
classpath, and then to the executor classloader.

Best,
Xiangrui

On Wed, May 21, 2014 at 3:50 PM, DB Tsai <[email protected]> wrote:
> @Xiangrui
> How about we send the primary jar and secondary jars into distributed cache
> without adding them into the system classloader of executors. Then we add
> them using custom classloader so we don't need to call secondary jars
> through reflection in primary jar. This will be consistent to what we do in
> standalone mode, and also solve the scalability of jar distribution issue.
>
> @Koert
> Yes, that's why I suggest we can either ignore the parent classloader of
> custom class loader to solve this as you say. In this case, we need add the
> all the classpath of the system loader into our custom one (which doesn't
> have parent) so we will not miss the default java classes. This is how
> tomcat works.
>
> @Patrick
> I agree that we should have the fix by Xiangrui first, since it solves most
> of the use case. I don't know when people will use dynamical addJar in Yarn
> since it's most useful for interactive environment.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, May 21, 2014 at 2:57 PM, Koert Kuipers <[email protected]> wrote:
>
>> db tsai, i do not think userClassPathFirst is working, unless the classes
>> you load dont reference any classes already loaded by the parent
>> classloader (a mostly hypothetical situation)... i filed a jira for this
>> here:
>> https://issues.apache.org/jira/browse/SPARK-1863
>>
>>
>>
>> On Tue, May 20, 2014 at 1:04 AM, DB Tsai <[email protected]> wrote:
>>
>> > In 1.0, there is a new option for users to choose which classloader has
>> > higher priority via spark.files.userClassPathFirst, I decided to submit
>> the
>> > PR for 0.9 first. We use this patch in our lab and we can use those jars
>> > added by sc.addJar without reflection.
>> >
>> > https://github.com/apache/spark/pull/834
>> >
>> > Can anyone comment if it's a good approach?
>> >
>> > Thanks.
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <[email protected]> wrote:
>> >
>> > > Good summary! We fixed it in branch 0.9 since our production is still
>> in
>> > > 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
>> > > tonight.
>> > >
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > -------------------------------------------------------
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >
>> > >
>> > > On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <[email protected]
>> > >wrote:
>> > >
>> > >> It just hit me why this problem is showing up on YARN and not on
>> > >> standalone.
>> > >>
>> > >> The relevant difference between YARN and standalone is that, on YARN,
>> > the
>> > >> app jar is loaded by the system classloader instead of Spark's custom
>> > URL
>> > >> classloader.
>> > >>
>> > >> On YARN, the system classloader knows about [the classes in the spark
>> > >> jars,
>> > >> the classes in the primary app jar].   The custom classloader knows
>> > about
>> > >> [the classes in secondary app jars] and has the system classloader as
>> > its
>> > >> parent.
>> > >>
>> > >> A few relevant facts (mostly redundant with what Sean pointed out):
>> > >> * Every class has a classloader that loaded it.
>> > >> * When an object of class B is instantiated inside of class A, the
>> > >> classloader used for loading B is the classloader that was used for
>> > >> loading
>> > >> A.
>> > >> * When a classloader fails to load a class, it lets its parent
>> > classloader
>> > >> try.  If its parent succeeds, its parent becomes the "classloader that
>> > >> loaded it".
>> > >>
>> > >> So suppose class B is in a secondary app jar and class A is in the
>> > primary
>> > >> app jar:
>> > >> 1. The custom classloader will try to load class A.
>> > >> 2. It will fail, because it only knows about the secondary jars.
>> > >> 3. It will delegate to its parent, the system classloader.
>> > >> 4. The system classloader will succeed, because it knows about the
>> > primary
>> > >> app jar.
>> > >> 5. A's classloader will be the system classloader.
>> > >> 6. A tries to instantiate an instance of class B.
>> > >> 7. B will be loaded with A's classloader, which is the system
>> > classloader.
>> > >> 8. Loading B will fail, because A's classloader, which is the system
>> > >> classloader, doesn't know about the secondary app jars.
>> > >>
>> > >> In Spark standalone, A and B are both loaded by the custom
>> classloader,
>> > so
>> > >> this issue doesn't come up.
>> > >>
>> > >> -Sandy
>> > >>
>> > >> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <[email protected]>
>> > >> wrote:
>> > >>
>> > >> > Having a user add define a custom class inside of an added jar and
>> > >> > instantiate it directly inside of an executor is definitely
>> supported
>> > >> > in Spark and has been for a really long time (several years). This
>> is
>> > >> > something we do all the time in Spark.
>> > >> >
>> > >> > DB - I'd hold off on a re-architecting of this until we identify
>> > >> > exactly what is causing the bug you are running into.
>> > >> >
>> > >> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>> > >> > it will ask the driver for the class over HTTP using a custom
>> > >> > classloader. Something in that pipeline is breaking here, possibly
>> > >> > related to the YARN deployment stuff.
>> > >> >
>> > >> >
>> > >> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <[email protected]>
>> > wrote:
>> > >> > > I don't think a customer classloader is necessary.
>> > >> > >
>> > >> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
>> > etc
>> > >> > > all run custom user code that creates new user objects without
>> > >> > > reflection. I should go see how that's done. Maybe it's totally
>> > valid
>> > >> > > to set the thread's context classloader for just this purpose,
>> and I
>> > >> > > am not thinking clearly.
>> > >> > >
>> > >> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <[email protected]
>> >
>> > >> > wrote:
>> > >> > >> Sounds like the problem is that classloaders always look in their
>> > >> > parents
>> > >> > >> before themselves, and Spark users want executors to pick up
>> > classes
>> > >> > from
>> > >> > >> their custom code before the ones in Spark plus its dependencies.
>> > >> > >>
>> > >> > >> Would a custom classloader that delegates to the parent after
>> first
>> > >> > >> checking itself fix this up?
>> > >> > >>
>> > >> > >>
>> > >> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <[email protected]>
>> > >> wrote:
>> > >> > >>
>> > >> > >>> Hi Sean,
>> > >> > >>>
>> > >> > >>> It's true that the issue here is classloader, and due to the
>> > >> > classloader
>> > >> > >>> delegation model, users have to use reflection in the executors
>> to
>> > >> > pick up
>> > >> > >>> the classloader in order to use those classes added by
>> sc.addJars
>> > >> APIs.
>> > >> > >>> However, it's very inconvenience for users, and not documented
>> in
>> > >> > spark.
>> > >> > >>>
>> > >> > >>> I'm working on a patch to solve it by calling the protected
>> method
>> > >> > addURL
>> > >> > >>> in URLClassLoader to update the current default classloader, so
>> no
>> > >> > >>> customClassLoader anymore. I wonder if this is an good way to
>> go.
>> > >> > >>>
>> > >> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>> > >> > >>>     try {
>> > >> > >>>       val method: Method =
>> > >> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>> classOf[URL])
>> > >> > >>>       method.setAccessible(true)
>> > >> > >>>       method.invoke(loader, url)
>> > >> > >>>     }
>> > >> > >>>     catch {
>> > >> > >>>       case t: Throwable => {
>> > >> > >>>         throw new IOException("Error, could not add URL to
>> system
>> > >> > >>> classloader")
>> > >> > >>>       }
>> > >> > >>>     }
>> > >> > >>>   }
>> > >> > >>>
>> > >> > >>>
>> > >> > >>>
>> > >> > >>> Sincerely,
>> > >> > >>>
>> > >> > >>> DB Tsai
>> > >> > >>> -------------------------------------------------------
>> > >> > >>> My Blog: https://www.dbtsai.com
>> > >> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >> > >>>
>> > >> > >>>
>> > >> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <[email protected]
>> >
>> > >> > wrote:
>> > >> > >>>
>> > >> > >>> > I might be stating the obvious for everyone, but the issue
>> here
>> > is
>> > >> > not
>> > >> > >>> > reflection or the source of the JAR, but the ClassLoader. The
>> > >> basic
>> > >> > >>> > rules are this.
>> > >> > >>> >
>> > >> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
>> > >> usually
>> > >> > >>> > the ClassLoader that loaded whatever it is that first
>> referenced
>> > >> Foo
>> > >> > >>> > and caused it to be loaded -- usually the ClassLoader holding
>> > your
>> > >> > >>> > other app classes.
>> > >> > >>> >
>> > >> > >>> > ClassLoaders can have a parent-child relationship.
>> ClassLoaders
>> > >> > always
>> > >> > >>> > look in their parent before themselves.
>> > >> > >>> >
>> > >> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
>> > app
>> > >> is
>> > >> > >>> > loaded in a child ClassLoader, and you reference a class that
>> > >> Hadoop
>> > >> > >>> > or Tomcat also has (like a lib class) you will get the
>> > container's
>> > >> > >>> > version!)
>> > >> > >>> >
>> > >> > >>> > When you load an external JAR it has a separate ClassLoader
>> > which
>> > >> > does
>> > >> > >>> > not necessarily bear any relation to the one containing your
>> app
>> > >> > >>> > classes, so yeah it is not generally going to make "new Foo"
>> > work.
>> > >> > >>> >
>> > >> > >>> > Reflection lets you pick the ClassLoader, yes.
>> > >> > >>> >
>> > >> > >>> > I would not call setContextClassLoader.
>> > >> > >>> >
>> > >> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>> > >> > [email protected]>
>> > >> > >>> > wrote:
>> > >> > >>> > > I spoke with DB offline about this a little while ago and he
>> > >> > confirmed
>> > >> > >>> > that
>> > >> > >>> > > he was able to access the jar from the driver.
>> > >> > >>> > >
>> > >> > >>> > > The issue appears to be a general Java issue: you can't
>> > directly
>> > >> > >>> > > instantiate a class from a dynamically loaded jar.
>> > >> > >>> > >
>> > >> > >>> > > I reproduced it locally outside of Spark with:
>> > >> > >>> > > ---
>> > >> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>> > >> URL[] {
>> > >> > new
>> > >> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > >> > >>> > >
>> > >> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > >> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>> > >> > >>> > > ---
>> > >> > >>> > >
>> > >> > >>> > > I was able to load the class with reflection.
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Reply via email to