I've been comtemplating writing a high level wrapper to Spark myself since I'm interested in both Julia & Spark but I was waiting for Julia 0.4 to finalize before even starting. One can do the integration on several levels: 1) simply wrap the Spark java API via JavaCall. This is the low level approach. BTW I've experimented with javaCall and found it was unstable & also lacking functionality (e.g. there's no way to shutdown the jvm or create a pool of JVM analogous to DB connections) so that might need some work before trying the Spark integration. 2) Spark 1.3 has now new and high level interfaces: dataframe API for accessing data in the form of distributed dataframes & pipeline API to compose algo via pipeline framework. By wrapping the spark dataframe with julia dataframe you would quickly have a high level (data scientist level) interface to Spark. BTW Spark dataframes are actually also FASTER than the more low level approaches like java/scala methods calls or Spark SQL (intermediate level) because Spark itself can do more optimizations (this is similar to how PyData Blaze works). By wrapping the pipeline API one could quickly compose Spark algos to create new algos. 3) for an intermediate approach : wrap the Spark SQL API and use SQL to query the system.
Personally I would start with dataframe & pipeline API. Maybe later on if needed add Spark SQL API and only do the low level stuff last if needed. But before interfacing Spark dataframes with julia ones the julia dataframe should become more powerful: at least && and || should be allowed in indexing for richer "querying" like in R dataframes. On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra wrote: > > This thread is to discuss Julia - Spark integration further. > > This is a continuation of discussions from > https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw (the > thread topic was misleading and we could not change it). > > To summarize briefly, here are a few interesting packages: > - https://github.com/d9w/Spark.jl > - https://github.com/jey/Spock.jl > - https://github.com/benhamner/MachineLearning.jl > <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ> > - packages at https://github.com/JuliaParallel > > We can discuss approaches and coordinate efforts towards whichever looks > promising. >
