I've been comtemplating writing a high level wrapper to Spark myself since 
I'm interested in both Julia & Spark but I was waiting for Julia 0.4 to 
finalize before even starting.
One can do the integration on several levels:
1) simply wrap the Spark java API via JavaCall. This is the low level 
approach. BTW I've experimented with javaCall and found it was unstable & 
also lacking functionality (e.g. there's no way to shutdown the jvm or 
create a pool of JVM analogous to DB connections) so that might need some 
work before trying the Spark integration.
2) Spark 1.3 has now new and high level interfaces: dataframe API for 
accessing data in the form of distributed dataframes & pipeline API to 
compose algo via pipeline framework. By wrapping the spark dataframe with 
julia dataframe you would quickly have a high level (data scientist level) 
interface to Spark. BTW Spark dataframes are actually also FASTER than the 
more low level approaches like java/scala methods calls or Spark SQL 
(intermediate level) because Spark itself can do more optimizations (this 
is similar to how PyData Blaze works). By wrapping the pipeline API one 
could quickly compose Spark algos to create new algos.
3) for an intermediate approach : wrap the Spark SQL API and use SQL to 
query the system.

Personally I would start with dataframe & pipeline API. Maybe later on if 
needed add Spark SQL API and only do the low level stuff last if needed. 
But before interfacing Spark dataframes with julia ones the julia dataframe 
should become more powerful: at least && and || should be allowed in 
indexing for richer "querying" like in R dataframes.

On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra 
wrote:
>
> This thread is to discuss Julia - Spark integration further.
>
> This is a continuation of discussions from 
> https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw (the 
> thread topic was misleading and we could not change it).
>
> To summarize briefly, here are a few interesting packages:
> - https://github.com/d9w/Spark.jl
> - https://github.com/jey/Spock.jl
> - https://github.com/benhamner/MachineLearning.jl 
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jl&sa=D&sntz=1&usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ>
> - packages at https://github.com/JuliaParallel
>
> We can discuss approaches and coordinate efforts towards whichever looks 
> promising.
>

Reply via email to