Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of creating transformations as independent abstraction operations with a lazy evaluation, the package has all transformations immediately executed upon their call. This is completely undermines whole purpose of RDD as fault-tolerant parallel data structure.
On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra wrote: > > There was some attempt made towards a pure Julia RDD in Spark.jl ( > https://github.com/d9w/Spark.jl). > We also have DistributedArrays ( > https://github.com/JuliaParallel/DistributedArrays.jl), Blocks ( > https://github.com/JuliaParallel/Blocks.jl) and ( > https://github.com/JuliaStats/DataFrames.jl). > > I wonder if it is possible to leverage any of these for a pure Julia RDD. > And MachineLearning.jl > <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ> > or > something similar could probably be the equivalent of MLib. > > > On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com wrote: >> >> Of course, a Spark data access infrastructure is unbeatable, due to >> mature JVM-based libraries for accessing various data sources and formats >> (avro, parquet, hdfs). That includes SQL support as well. But, look at >> Python and R bindings, these are just facades for JVM calls. MLLib is >> written in Scala, Streaming API as well, and then all this called from >> Python or R, all data transformations happen on JVM level. It would be more >> efficient write code in Scala then use any non-JVM bindings. Think of >> overhead for RPC and data serialization over huge volumes of data needed to >> be processed and you'll understand why Dpark exists. BTW, machine learning >> libraries in JVM, good luck. It only works because of large computational >> resources used, but even that has its limits. >> >> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote: >>> >>> Julia bindings for Spark would provide much more than just RDD, they >>> will give us access to multiple big data components for streaming, machine >>> learning, SQL capabilities and much more. >>> >>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote: >>>> >>>> However, I wonder, how hard it would be to implement RDD in Julia? It >>>> looks straight forward from a RDD paper >>>> <https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf> how to >>>> implement it. It is a robust abstraction that can be used in any parallel >>>> computation. >>>> >>>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote: >>>>> >>>>> yes that's a solid approach. For my personal julia - java integrations >>>>> I also run the JVM in a separate process. >>>>> >>>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com >>>>> wrote: >>>>>> >>>>>> 1) simply wrap the Spark java API via JavaCall. This is the low level >>>>>>> approach. BTW I've experimented with javaCall and found it was unstable >>>>>>> & >>>>>>> also lacking functionality (e.g. there's no way to shutdown the jvm or >>>>>>> create a pool of JVM analogous to DB connections) so that might need >>>>>>> some >>>>>>> work before trying the Spark integration. >>>>>>> >>>>>> >>>>>> Using JavaCall is not an option, especially when JVM became >>>>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7. >>>>>> >>>>>> Python bindings are done through Py4J, which is RPC to JVM. If you >>>>>> look at the sparkR <https://github.com/apache/spark/tree/master/R>, >>>>>> it is done in a same way. sparkR uses a RPC interface to communicate >>>>>> with a >>>>>> Netty-based Spark JVM backend that translates R calls into JVM calls, >>>>>> keeps >>>>>> SparkContext on a JVM side, and ships serialized data to/from R. >>>>>> >>>>>> So it is just a matter of writing Julia RPC to JVM and wrapping >>>>>> necessary Spark methods in a Julia friendly way. >>>>>> >>>>>