Re: [julia-users] Re: Julia and Spark

Jey Kottalam Sat, 31 Oct 2015 09:18:07 -0700

Could you please define "streams of RDDs"?

On Sat, Oct 31, 2015 at 12:59 AM, <[email protected]> wrote:


> Is there any implementation with streams of RDDs for Julia ?
>
>
> On Monday, April 20, 2015 at 11:54:10 AM UTC-7, [email protected] wrote:
>>
>> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of
>> creating transformations as independent abstraction operations with a lazy
>> evaluation, the package has all transformations immediately executed upon
>> their call. This is completely undermines whole purpose of RDD as
>> fault-tolerant parallel data structure.
>>
>> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra
>> wrote:
>>>
>>> There was some attempt made towards a pure Julia RDD in Spark.jl (
>>> https://github.com/d9w/Spark.jl).
>>> We also have DistributedArrays (
>>> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
>>> https://github.com/JuliaParallel/Blocks.jl) and (
>>> https://github.com/JuliaStats/DataFrames.jl).
>>>
>>> I wonder if it is possible to leverage any of these for a pure Julia RDD.
>>> And MachineLearning.jl
>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ>
>>>  or
>>> something similar could probably be the equivalent of MLib.
>>>
>>>
>>> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, [email protected]
>>> wrote:
>>>>
>>>> Of course, a Spark data access infrastructure is unbeatable, due to
>>>> mature JVM-based libraries for accessing various data sources and formats
>>>> (avro, parquet, hdfs). That includes SQL support as well. But, look at
>>>> Python and R bindings, these are just facades for JVM calls. MLLib is
>>>> written in Scala, Streaming API as well, and then all this called from
>>>> Python or R, all data transformations happen on JVM level. It would be more
>>>> efficient write code in Scala then use any non-JVM bindings. Think of
>>>> overhead for RPC and data serialization over huge volumes of data needed to
>>>> be processed and you'll understand why Dpark exists. BTW, machine learning
>>>> libraries in JVM, good luck. It only works because of large computational
>>>> resources used, but even that has its limits.
>>>>
>>>> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>>>>>
>>>>> Julia bindings for Spark would provide much more than just RDD, they
>>>>> will give us access to multiple big data components for streaming, machine
>>>>> learning, SQL capabilities and much more.
>>>>>
>>>>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, [email protected]
>>>>> wrote:
>>>>>>
>>>>>> However, I wonder, how hard it would be to implement RDD in Julia? It
>>>>>> looks straight forward from a RDD paper
>>>>>> <https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf> how
>>>>>> to implement it. It is a robust abstraction that can be used in any
>>>>>> parallel computation.
>>>>>>
>>>>>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:
>>>>>>>
>>>>>>> yes that's a solid approach. For my personal julia - java
>>>>>>> integrations I also run the JVM in a separate process.
>>>>>>>
>>>>>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, [email protected]
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> 1) simply wrap the Spark java API via JavaCall. This is the low
>>>>>>>>> level approach. BTW I've experimented with javaCall and found it was
>>>>>>>>> unstable & also lacking functionality (e.g. there's no way to 
>>>>>>>>> shutdown the
>>>>>>>>> jvm or create a pool of JVM analogous to DB connections) so that 
>>>>>>>>> might need
>>>>>>>>> some work before trying the Spark integration.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Using JavaCall is not an option, especially when JVM became
>>>>>>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>>>>>>>>
>>>>>>>> Python bindings are done through Py4J, which is RPC to JVM. If you
>>>>>>>> look at the sparkR <https://github.com/apache/spark/tree/master/R>,
>>>>>>>> it is done in a same way. sparkR uses a RPC interface to communicate 
>>>>>>>> with a
>>>>>>>> Netty-based Spark JVM backend that translates R calls into JVM calls, 
>>>>>>>> keeps
>>>>>>>> SparkContext on a JVM side, and ships serialized data to/from R.
>>>>>>>>
>>>>>>>> So it is just a matter of writing Julia RPC to JVM and wrapping
>>>>>>>> necessary Spark methods in a Julia friendly way.
>>>>>>>>
>>>>>>>

Re: [julia-users] Re: Julia and Spark

Reply via email to