[julia-users] Re: Spark and Julia

wildart Sat, 04 Apr 2015 21:10:46 -0700

Spark integration is a tricky thing. Python and R bindings go in a great 
length to map language specific functions into Spark JVM library calls. I 
guess same could be done with JavaCall.jl package in a manner similar to 
SparkR. Look at slide 20 from here: 
http://spark-summit.org/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf.


Spark is a clever distributed data access paradigm which grew from Hadoop 
slowness and limitations. I believe that Julia could provide competitive 
model for a distributed data storage given Julia's parallel computing 
approach. Right now, I am writing Julia bindings for Mesos. The idea is to 
provide, though ClusterManager, access to any Mesos-supervised distributed 
system and run Julia code that environment. In conjuction with 
DistributedArrays and DataFrames, it will create powerful toolbox for 
building distributed systems.
       
After all, machine learning on JVM, really?!.

On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote:
>
>
>
> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote:
>>
>> I am changing the subject of this thread from GSOC to Spark. I was just 
>> looking around and found this:
>>
>> https://github.com/d9w/Spark.jl 
>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q>
>>
>
> Hey, wow, that's interesting, is this an attempt to reimplement Spark or 
> create a binding? 
>  
>
>>  
>>
> The real question is with all the various systems out there, what is the 
>> level of abstraction that julia should work with. Julia's DataFrames is one 
>> level of abstraction, which could also transparently map to csv files 
>> (rather than doing readtable), or a database table, or an HBase table. Why 
>> would Spark users want Julia, and why would Julia users want Spark? I guess 
>> if we can nail this down - the rest of the integration is probably easy to 
>> figure out.
>>
>  
> As a potential user, I will try to answer in a few parts
>
> There are currently 3 official language bindings (Java, Scala, Python) and 
> some unofficial ones as well
> and R in the works; one thing that users would want is whatever the others 
> get but in the language they
> desire with an abstraction similar to the other language bindings so that 
> examples in other languages
> could be readily translated to theirs.
>
> Whatever the abstraction turns out the be, there are at least 3 big things 
> that Spark offers; simplification,
> speed, and lazy evaluation.  The abstraction should not make that 
> cumbersome.
>
> For me, the advantage of Julia is the syntax, the speed, and the 
> connection to all of the Julia packages
> and because of that the community of Julia package authors.  The advantage 
> of Spark is the machinery
> of Spark, access to mlib and likewise the community of Spark users.
>
> How about an example?  This is simply from Spark examples -- good old 
> K-means.  This is assuming
> the Python binding because probably Julia and Python are most alike, how 
> would we expect this to 
> look using Julia?
>
> from pyspark.mllib.clustering import KMeans
> from numpy import array
> from math import sqrt
>
> # Load and parse the data
> data = sc.textFile("data/mllib/kmeans_data.txt")
> parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
>
> # Build the model (cluster the data)
> clusters = KMeans.train(parsedData, 2, maxIterations=10,
>         runs=10, initializationMode="random")
>
> # Evaluate clustering by computing Within Set Sum of Squared Errors
> def error(point):
>     center = clusters.centers[clusters.predict(point)]
>     return sqrt(sum([x**2 for x in (point - center)]))
>
> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
> print("Within Set Sum of Squared Error = " + str(WSSSE))
>
>
>
>
>

[julia-users] Re: Spark and Julia

Reply via email to