[julia-users] Re: Julia and Spark

Tanmay K. Mohapatra Sat, 18 Apr 2015 01:05:23 -0700

There was some attempt made towards a pure Julia RDD in Spark.jl (
https://github.com/d9w/Spark.jl).
We also have DistributedArrays 
(https://github.com/JuliaParallel/DistributedArrays.jl), Blocks 
(https://github.com/JuliaParallel/Blocks.jl) and 
(https://github.com/JuliaStats/DataFrames.jl).


I wonder if it is possible to leverage any of these for a pure Julia RDD.
And MachineLearning.jl 
<https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ>
 or 
something similar could probably be the equivalent of MLib.


On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, [email protected] wrote:
>
> Of course, a Spark data access infrastructure is unbeatable, due to mature 
> JVM-based libraries for accessing various data sources and formats (avro, 
> parquet, hdfs). That includes SQL support as well. But, look at Python and 
> R bindings, these are just facades for JVM calls. MLLib is written in 
> Scala, Streaming API as well, and then all this called from Python or R, 
> all data transformations happen on JVM level. It would be more efficient 
> write code in Scala then use any non-JVM bindings. Think of overhead for 
> RPC and data serialization over huge volumes of data needed to be processed 
> and you'll understand why Dpark exists. BTW, machine learning libraries in 
> JVM, good luck. It only works because of large computational resources 
> used, but even that has its limits.
>
> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>>
>> Julia bindings for Spark would provide much more than just RDD, they will 
>> give us access to multiple big data components for streaming, machine 
>> learning, SQL capabilities and much more. 
>>
>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, [email protected] wrote:
>>>
>>> However, I wonder, how hard it would be to implement RDD in Julia? It 
>>> looks straight forward from a RDD paper 
>>> <https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf> how to 
>>> implement it. It is a robust abstraction that can be used in any parallel 
>>> computation.
>>>
>>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:
>>>>
>>>> yes that's a solid approach. For my personal julia - java integrations 
>>>> I also run the JVM in a separate process.
>>>>
>>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, [email protected] 
>>>> wrote:
>>>>>
>>>>> 1) simply wrap the Spark java API via JavaCall. This is the low level 
>>>>>> approach. BTW I've experimented with javaCall and found it was unstable 
>>>>>> & 
>>>>>> also lacking functionality (e.g. there's no way to shutdown the jvm or 
>>>>>> create a pool of JVM analogous to DB connections) so that might need 
>>>>>> some 
>>>>>> work before trying the Spark integration.
>>>>>>
>>>>>
>>>>> Using JavaCall is not an option, especially when JVM became 
>>>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>>>>>
>>>>> Python bindings are done through Py4J, which is RPC to JVM. If you 
>>>>> look at the sparkR <https://github.com/apache/spark/tree/master/R>, 
>>>>> it is done in a same way. sparkR uses a RPC interface to communicate with 
>>>>> a 
>>>>> Netty-based Spark JVM backend that translates R calls into JVM calls, 
>>>>> keeps 
>>>>> SparkContext on a JVM side, and ships serialized data to/from R.
>>>>>
>>>>> So it is just a matter of writing Julia RPC to JVM and wrapping 
>>>>> necessary Spark methods in a Julia friendly way. 
>>>>>
>>>>

[julia-users] Re: Julia and Spark

Reply via email to