Hi DB, I found it is a little hard to implement the solution I mentioned:
> Do not send the primary jar and secondary jars to executors' > distributed cache. Instead, add them to "spark.jars" in SparkSubmit > and serve them via http by called sc.addJar in SparkContext. If you look at ApplicationMaster code, which is entry point in yarn-cluster mode. It actually creates a thread of the user class first and waits the user class to create a spark context. It means the user class has to be on the classpath at that time. I think we need to add the primary jar and secondary jars twice, once to system classpath, and then to the executor classloader. Best, Xiangrui On Wed, May 21, 2014 at 3:50 PM, DB Tsai <dbt...@stanford.edu> wrote: > @Xiangrui > How about we send the primary jar and secondary jars into distributed cache > without adding them into the system classloader of executors. Then we add > them using custom classloader so we don't need to call secondary jars > through reflection in primary jar. This will be consistent to what we do in > standalone mode, and also solve the scalability of jar distribution issue. > > @Koert > Yes, that's why I suggest we can either ignore the parent classloader of > custom class loader to solve this as you say. In this case, we need add the > all the classpath of the system loader into our custom one (which doesn't > have parent) so we will not miss the default java classes. This is how > tomcat works. > > @Patrick > I agree that we should have the fix by Xiangrui first, since it solves most > of the use case. I don't know when people will use dynamical addJar in Yarn > since it's most useful for interactive environment. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Wed, May 21, 2014 at 2:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> db tsai, i do not think userClassPathFirst is working, unless the classes >> you load dont reference any classes already loaded by the parent >> classloader (a mostly hypothetical situation)... i filed a jira for this >> here: >> https://issues.apache.org/jira/browse/SPARK-1863 >> >> >> >> On Tue, May 20, 2014 at 1:04 AM, DB Tsai <dbt...@stanford.edu> wrote: >> >> > In 1.0, there is a new option for users to choose which classloader has >> > higher priority via spark.files.userClassPathFirst, I decided to submit >> the >> > PR for 0.9 first. We use this patch in our lab and we can use those jars >> > added by sc.addJar without reflection. >> > >> > https://github.com/apache/spark/pull/834 >> > >> > Can anyone comment if it's a good approach? >> > >> > Thanks. >> > >> > >> > Sincerely, >> > >> > DB Tsai >> > ------------------------------------------------------- >> > My Blog: https://www.dbtsai.com >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> > >> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbt...@stanford.edu> wrote: >> > >> > > Good summary! We fixed it in branch 0.9 since our production is still >> in >> > > 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0 >> > > tonight. >> > > >> > > >> > > Sincerely, >> > > >> > > DB Tsai >> > > ------------------------------------------------------- >> > > My Blog: https://www.dbtsai.com >> > > LinkedIn: https://www.linkedin.com/in/dbtsai >> > > >> > > >> > > On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.r...@cloudera.com >> > >wrote: >> > > >> > >> It just hit me why this problem is showing up on YARN and not on >> > >> standalone. >> > >> >> > >> The relevant difference between YARN and standalone is that, on YARN, >> > the >> > >> app jar is loaded by the system classloader instead of Spark's custom >> > URL >> > >> classloader. >> > >> >> > >> On YARN, the system classloader knows about [the classes in the spark >> > >> jars, >> > >> the classes in the primary app jar]. The custom classloader knows >> > about >> > >> [the classes in secondary app jars] and has the system classloader as >> > its >> > >> parent. >> > >> >> > >> A few relevant facts (mostly redundant with what Sean pointed out): >> > >> * Every class has a classloader that loaded it. >> > >> * When an object of class B is instantiated inside of class A, the >> > >> classloader used for loading B is the classloader that was used for >> > >> loading >> > >> A. >> > >> * When a classloader fails to load a class, it lets its parent >> > classloader >> > >> try. If its parent succeeds, its parent becomes the "classloader that >> > >> loaded it". >> > >> >> > >> So suppose class B is in a secondary app jar and class A is in the >> > primary >> > >> app jar: >> > >> 1. The custom classloader will try to load class A. >> > >> 2. It will fail, because it only knows about the secondary jars. >> > >> 3. It will delegate to its parent, the system classloader. >> > >> 4. The system classloader will succeed, because it knows about the >> > primary >> > >> app jar. >> > >> 5. A's classloader will be the system classloader. >> > >> 6. A tries to instantiate an instance of class B. >> > >> 7. B will be loaded with A's classloader, which is the system >> > classloader. >> > >> 8. Loading B will fail, because A's classloader, which is the system >> > >> classloader, doesn't know about the secondary app jars. >> > >> >> > >> In Spark standalone, A and B are both loaded by the custom >> classloader, >> > so >> > >> this issue doesn't come up. >> > >> >> > >> -Sandy >> > >> >> > >> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwend...@gmail.com> >> > >> wrote: >> > >> >> > >> > Having a user add define a custom class inside of an added jar and >> > >> > instantiate it directly inside of an executor is definitely >> supported >> > >> > in Spark and has been for a really long time (several years). This >> is >> > >> > something we do all the time in Spark. >> > >> > >> > >> > DB - I'd hold off on a re-architecting of this until we identify >> > >> > exactly what is causing the bug you are running into. >> > >> > >> > >> > In a nutshell, when the bytecode "new Foo()" is run on the executor, >> > >> > it will ask the driver for the class over HTTP using a custom >> > >> > classloader. Something in that pipeline is breaking here, possibly >> > >> > related to the YARN deployment stuff. >> > >> > >> > >> > >> > >> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> >> > wrote: >> > >> > > I don't think a customer classloader is necessary. >> > >> > > >> > >> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, >> > etc >> > >> > > all run custom user code that creates new user objects without >> > >> > > reflection. I should go see how that's done. Maybe it's totally >> > valid >> > >> > > to set the thread's context classloader for just this purpose, >> and I >> > >> > > am not thinking clearly. >> > >> > > >> > >> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <and...@andrewash.com >> > >> > >> > wrote: >> > >> > >> Sounds like the problem is that classloaders always look in their >> > >> > parents >> > >> > >> before themselves, and Spark users want executors to pick up >> > classes >> > >> > from >> > >> > >> their custom code before the ones in Spark plus its dependencies. >> > >> > >> >> > >> > >> Would a custom classloader that delegates to the parent after >> first >> > >> > >> checking itself fix this up? >> > >> > >> >> > >> > >> >> > >> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <dbt...@stanford.edu> >> > >> wrote: >> > >> > >> >> > >> > >>> Hi Sean, >> > >> > >>> >> > >> > >>> It's true that the issue here is classloader, and due to the >> > >> > classloader >> > >> > >>> delegation model, users have to use reflection in the executors >> to >> > >> > pick up >> > >> > >>> the classloader in order to use those classes added by >> sc.addJars >> > >> APIs. >> > >> > >>> However, it's very inconvenience for users, and not documented >> in >> > >> > spark. >> > >> > >>> >> > >> > >>> I'm working on a patch to solve it by calling the protected >> method >> > >> > addURL >> > >> > >>> in URLClassLoader to update the current default classloader, so >> no >> > >> > >>> customClassLoader anymore. I wonder if this is an good way to >> go. >> > >> > >>> >> > >> > >>> private def addURL(url: URL, loader: URLClassLoader){ >> > >> > >>> try { >> > >> > >>> val method: Method = >> > >> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", >> classOf[URL]) >> > >> > >>> method.setAccessible(true) >> > >> > >>> method.invoke(loader, url) >> > >> > >>> } >> > >> > >>> catch { >> > >> > >>> case t: Throwable => { >> > >> > >>> throw new IOException("Error, could not add URL to >> system >> > >> > >>> classloader") >> > >> > >>> } >> > >> > >>> } >> > >> > >>> } >> > >> > >>> >> > >> > >>> >> > >> > >>> >> > >> > >>> Sincerely, >> > >> > >>> >> > >> > >>> DB Tsai >> > >> > >>> ------------------------------------------------------- >> > >> > >>> My Blog: https://www.dbtsai.com >> > >> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> > >>> >> > >> > >>> >> > >> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com >> > >> > >> > wrote: >> > >> > >>> >> > >> > >>> > I might be stating the obvious for everyone, but the issue >> here >> > is >> > >> > not >> > >> > >>> > reflection or the source of the JAR, but the ClassLoader. The >> > >> basic >> > >> > >>> > rules are this. >> > >> > >>> > >> > >> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is >> > >> usually >> > >> > >>> > the ClassLoader that loaded whatever it is that first >> referenced >> > >> Foo >> > >> > >>> > and caused it to be loaded -- usually the ClassLoader holding >> > your >> > >> > >>> > other app classes. >> > >> > >>> > >> > >> > >>> > ClassLoaders can have a parent-child relationship. >> ClassLoaders >> > >> > always >> > >> > >>> > look in their parent before themselves. >> > >> > >>> > >> > >> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your >> > app >> > >> is >> > >> > >>> > loaded in a child ClassLoader, and you reference a class that >> > >> Hadoop >> > >> > >>> > or Tomcat also has (like a lib class) you will get the >> > container's >> > >> > >>> > version!) >> > >> > >>> > >> > >> > >>> > When you load an external JAR it has a separate ClassLoader >> > which >> > >> > does >> > >> > >>> > not necessarily bear any relation to the one containing your >> app >> > >> > >>> > classes, so yeah it is not generally going to make "new Foo" >> > work. >> > >> > >>> > >> > >> > >>> > Reflection lets you pick the ClassLoader, yes. >> > >> > >>> > >> > >> > >>> > I would not call setContextClassLoader. >> > >> > >>> > >> > >> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza < >> > >> > sandy.r...@cloudera.com> >> > >> > >>> > wrote: >> > >> > >>> > > I spoke with DB offline about this a little while ago and he >> > >> > confirmed >> > >> > >>> > that >> > >> > >>> > > he was able to access the jar from the driver. >> > >> > >>> > > >> > >> > >>> > > The issue appears to be a general Java issue: you can't >> > directly >> > >> > >>> > > instantiate a class from a dynamically loaded jar. >> > >> > >>> > > >> > >> > >>> > > I reproduced it locally outside of Spark with: >> > >> > >>> > > --- >> > >> > >>> > > URLClassLoader urlClassLoader = new URLClassLoader(new >> > >> URL[] { >> > >> > new >> > >> > >>> > > File("myotherjar.jar").toURI().toURL() }, null); >> > >> > >>> > > >> > >> Thread.currentThread().setContextClassLoader(urlClassLoader); >> > >> > >>> > > MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar(); >> > >> > >>> > > --- >> > >> > >>> > > >> > >> > >>> > > I was able to load the class with reflection. >> > >> > >>> > >> > >> > >>> >> > >> > >> > >> >> > > >> > > >> > >>