Re: [julia-users] Re: Julia and Spark

ssarkarayushnetdev Sun, 01 Nov 2015 10:31:12 -0800

Yes.


On Sunday, November 1, 2015 at 9:34:26 AM UTC-8, Jey Kottalam wrote:
>
> Are you asking about Spark Streaming support?
>
> On Sun, Nov 1, 2015 at 4:42 AM, Sisyphuss <[email protected] 
> <javascript:>> wrote:
>
>> http://dl.acm.org/citation.cfm?id=2228301
>>
>> On Saturday, October 31, 2015 at 5:18:01 PM UTC+1, Jey Kottalam wrote:
>>>
>>> Could you please define "streams of RDDs"?
>>>
>>> On Sat, Oct 31, 2015 at 12:59 AM, <[email protected]> wrote:
>>>
>>>> Is there any implementation with streams of RDDs for Julia ? 
>>>>
>>>>
>>>> On Monday, April 20, 2015 at 11:54:10 AM UTC-7, [email protected] wrote:
>>>>>
>>>>> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of 
>>>>> creating transformations as independent abstraction operations with a 
>>>>> lazy 
>>>>> evaluation, the package has all transformations immediately executed upon 
>>>>> their call. This is completely undermines whole purpose of RDD as 
>>>>> fault-tolerant parallel data structure.
>>>>>
>>>>> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra 
>>>>> wrote:
>>>>>>
>>>>>> There was some attempt made towards a pure Julia RDD in Spark.jl (
>>>>>> https://github.com/d9w/Spark.jl).
>>>>>> We also have DistributedArrays (
>>>>>> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
>>>>>> https://github.com/JuliaParallel/Blocks.jl) and (
>>>>>> https://github.com/JuliaStats/DataFrames.jl).
>>>>>>
>>>>>> I wonder if it is possible to leverage any of these for a pure Julia 
>>>>>> RDD.
>>>>>> And MachineLearning.jl 
>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ>
>>>>>>  or 
>>>>>> something similar could probably be the equivalent of MLib.
>>>>>>
>>>>>>
>>>>>> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, [email protected] 
>>>>>> wrote:
>>>>>>>
>>>>>>> Of course, a Spark data access infrastructure is unbeatable, due to 
>>>>>>> mature JVM-based libraries for accessing various data sources and 
>>>>>>> formats 
>>>>>>> (avro, parquet, hdfs). That includes SQL support as well. But, look at 
>>>>>>> Python and R bindings, these are just facades for JVM calls. MLLib is 
>>>>>>> written in Scala, Streaming API as well, and then all this called from 
>>>>>>> Python or R, all data transformations happen on JVM level. It would be 
>>>>>>> more 
>>>>>>> efficient write code in Scala then use any non-JVM bindings. Think of 
>>>>>>> overhead for RPC and data serialization over huge volumes of data 
>>>>>>> needed to 
>>>>>>> be processed and you'll understand why Dpark exists. BTW, machine 
>>>>>>> learning 
>>>>>>> libraries in JVM, good luck. It only works because of large 
>>>>>>> computational 
>>>>>>> resources used, but even that has its limits.
>>>>>>>
>>>>>>> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>>>>>>>>
>>>>>>>> Julia bindings for Spark would provide much more than just RDD, 
>>>>>>>> they will give us access to multiple big data components for 
>>>>>>>> streaming, 
>>>>>>>> machine learning, SQL capabilities and much more. 
>>>>>>>>
>>>>>>>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, [email protected] 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> However, I wonder, how hard it would be to implement RDD in Julia? 
>>>>>>>>> It looks straight forward from a RDD paper 
>>>>>>>>> <https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf> 
>>>>>>>>> how to implement it. It is a robust abstraction that can be used in 
>>>>>>>>> any 
>>>>>>>>> parallel computation.
>>>>>>>>>
>>>>>>>>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> yes that's a solid approach. For my personal julia - java 
>>>>>>>>>> integrations I also run the JVM in a separate process.
>>>>>>>>>>
>>>>>>>>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, 
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>
>>>>>>>>>>> 1) simply wrap the Spark java API via JavaCall. This is the low 
>>>>>>>>>>>> level approach. BTW I've experimented with javaCall and found it 
>>>>>>>>>>>> was 
>>>>>>>>>>>> unstable & also lacking functionality (e.g. there's no way to 
>>>>>>>>>>>> shutdown the 
>>>>>>>>>>>> jvm or create a pool of JVM analogous to DB connections) so that 
>>>>>>>>>>>> might need 
>>>>>>>>>>>> some work before trying the Spark integration.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Using JavaCall is not an option, especially when JVM became 
>>>>>>>>>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> Python bindings are done through Py4J, which is RPC to JVM. If 
>>>>>>>>>>> you look at the sparkR 
>>>>>>>>>>> <https://github.com/apache/spark/tree/master/R>, it is done in 
>>>>>>>>>>> a same way. sparkR uses a RPC interface to communicate with a 
>>>>>>>>>>> Netty-based 
>>>>>>>>>>> Spark JVM backend that translates R calls into JVM calls, keeps 
>>>>>>>>>>> SparkContext on a JVM side, and ships serialized data to/from R.
>>>>>>>>>>>
>>>>>>>>>>> So it is just a matter of writing Julia RPC to JVM and wrapping 
>>>>>>>>>>> necessary Spark methods in a Julia friendly way. 
>>>>>>>>>>>
>>>>>>>>>>
>>>
>

Re: [julia-users] Re: Julia and Spark

Reply via email to