[julia-users] Re: Julia and Spark

wildart Fri, 17 Apr 2015 08:55:02 -0700

Of course, a Spark data access infrastructure is unbeatable, due to mature 
JVM-based libraries for accessing various data sources and formats (avro, 
parquet, hdfs). That includes SQL support as well. But, look at Python and 
R bindings, these are just facades for JVM calls. MLLib is written in 
Scala, Streaming API as well, and then all this called from Python or R, 
all data transformations happen on JVM level. It would be more efficient 
write code in Scala then use any non-JVM bindings. Think of overhead for 
RPC and data serialization over huge volumes of data needed to be processed 
and you'll understand why Dpark exists. BTW, machine learning libraries in 
JVM, good luck. It only works because of large computational resources 
used, but even that has its limits.


On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>
> Julia bindings for Spark would provide much more than just RDD, they will 
> give us access to multiple big data components for streaming, machine 
> learning, SQL capabilities and much more. 
>
> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, [email protected] wrote:
>>
>> However, I wonder, how hard it would be to implement RDD in Julia? It 
>> looks straight forward from a RDD paper 
>> <https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf> how to 
>> implement it. It is a robust abstraction that can be used in any parallel 
>> computation.
>>
>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:
>>>
>>> yes that's a solid approach. For my personal julia - java integrations I 
>>> also run the JVM in a separate process.
>>>
>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, [email protected] 
>>> wrote:
>>>>
>>>> 1) simply wrap the Spark java API via JavaCall. This is the low level 
>>>>> approach. BTW I've experimented with javaCall and found it was unstable & 
>>>>> also lacking functionality (e.g. there's no way to shutdown the jvm or 
>>>>> create a pool of JVM analogous to DB connections) so that might need some 
>>>>> work before trying the Spark integration.
>>>>>
>>>>
>>>> Using JavaCall is not an option, especially when JVM became 
>>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>>>>
>>>> Python bindings are done through Py4J, which is RPC to JVM. If you look 
>>>> at the sparkR <https://github.com/apache/spark/tree/master/R>, it is 
>>>> done in a same way. sparkR uses a RPC interface to communicate with a 
>>>> Netty-based Spark JVM backend that translates R calls into JVM calls, 
>>>> keeps 
>>>> SparkContext on a JVM side, and ships serialized data to/from R.
>>>>
>>>> So it is just a matter of writing Julia RPC to JVM and wrapping 
>>>> necessary Spark methods in a Julia friendly way. 
>>>>
>>>

[julia-users] Re: Julia and Spark

Reply via email to