[jira] [Commented] (SPARK-15582) Support for Groovy closures

Steve Loughran (JIRA) Wed, 01 Jun 2016 03:35:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310096#comment-15310096
 ]


Steve Loughran commented on SPARK-15582:
----------------------------------------

you are way ahead of me in your groovy skills, and I'm not in a position to 
offer more advice.

I'm not convinced that passing the groovy source around should be considered so 
bad. Yes, there's the cost of passing the source file, but the parsing is done 
in parallel on all machines. I guess it comes down to how much groovy code do 
you have to compile.

Regarding improving support for dynamic class compilation, if you could make it 
something more generic than just groovy (clojure, jython, javascript) then it 
may get support ... I'd discuss it with committers to see how much before 
investing too much effort. One thing to look at here is how ACCUMULO-708 added 
a classloader which can pull straight off HDFS. There's also whatever mechanism 
Hive uses for its Groovy UDF —I don't know the details there, I'm afraid

> Support for Groovy closures
> ---------------------------
>
>                 Key: SPARK-15582
>                 URL: https://issues.apache.org/jira/browse/SPARK-15582
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, Java API
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>         Environment: 6 node Debian 8 based Spark cluster
>            Reporter: Catalin Alexandru Zamfir
>
> After fixing SPARK-13599 and running one of our jobs against this fix for 
> Groovy dependencies (which indeed it fixed), we see the Spark executors stuck 
> at a ClassNotFound exception when running as a Script (via 
> GroovyShell.evalute (scriptText)). It seems Spark cannot de-serialize the 
> closure, or the closure is not received by the executor.
> {noformat}
> sparkContext.binaryFiles (ourPath).flatMap ({ onePathEntry -> code-block } as 
> FlatMapFunction).count ();
> { onePathEntry -> code-block } denotes a Groovy closure.
> {noformat}
> There is a groovy-spark example @ 
> https://github.com/bunions1/groovy-spark-example ... However the above uses a 
> modified Groovy. If my understanding is correct, Groovy compiles to native 
> byte-code, which should be easy for Spark to pick-up and use closures.
> The above example code fails with this stack-trace:
> {noformat}
> Caused by: java.lang.ClassNotFoundException: Script1$_run_closure1
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>       at java.lang.Class.forName0(Native Method)
>       at java.lang.Class.forName(Class.java:348)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>       at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>       at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>       at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>       at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>       at org.apache.spark.scheduler.Task.run(Task.scala:89)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214
> {noformat}
> Any ideas on how to tackle this, welcomed. I've tried Googling around for 
> similar issues, but nobody has found a solution.
> At least, point me on where to "hack" to make Spark support closures and I'd 
> share some of my time to make it work. There is SPARK-2171 arguing that 
> support for this is out of the box, but for projects of a relative complex 
> size where the driver application is contained/part-of a bigger application 
> and running on a cluster, things do not seem to work. I don't know if 
> SPARK-2171 has tried to run outside of a local[] cluster set-up where such 
> issues can arise.
> I saw a couple of people trying to make it to work, but again, they look to 
> work-arounds (eg. distribution of byte-code manually before needed, adding a 
> JAR with addJar and other work-arounds).
> Can this be done? Where can we look?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15582) Support for Groovy closures

Reply via email to