[julia-users] Re: Spark and Julia

Viral Shah Sun, 05 Apr 2015 02:00:08 -0700

It would be nice to co-ordinate these efforts under the JuliaParallel 
organization.


-viral

On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, [email protected] wrote:
>
> Spark integration is a tricky thing. Python and R bindings go in a great 
> length to map language specific functions into Spark JVM library calls. I 
> guess same could be done with JavaCall.jl package in a manner similar to 
> SparkR. Look at slide 20 from here: 
> http://spark-summit.org/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf.
>
> Spark is a clever distributed data access paradigm which grew from Hadoop 
> slowness and limitations. I believe that Julia could provide competitive 
> model for a distributed data storage given Julia's parallel computing 
> approach. Right now, I am writing Julia bindings for Mesos. The idea is to 
> provide, though ClusterManager, access to any Mesos-supervised distributed 
> system and run Julia code that environment. In conjuction with 
> DistributedArrays and DataFrames, it will create powerful toolbox for 
> building distributed systems.
>        
> After all, machine learning on JVM, really?!.
>
> On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote:
>>
>>
>>
>> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote:
>>>
>>> I am changing the subject of this thread from GSOC to Spark. I was just 
>>> looking around and found this:
>>>
>>> https://github.com/d9w/Spark.jl 
>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q>
>>>
>>
>> Hey, wow, that's interesting, is this an attempt to reimplement Spark or 
>> create a binding? 
>>  
>>
>>>  
>>>
>> The real question is with all the various systems out there, what is the 
>>> level of abstraction that julia should work with. Julia's DataFrames is one 
>>> level of abstraction, which could also transparently map to csv files 
>>> (rather than doing readtable), or a database table, or an HBase table. Why 
>>> would Spark users want Julia, and why would Julia users want Spark? I guess 
>>> if we can nail this down - the rest of the integration is probably easy to 
>>> figure out.
>>>
>>  
>> As a potential user, I will try to answer in a few parts
>>
>> There are currently 3 official language bindings (Java, Scala, Python) 
>> and some unofficial ones as well
>> and R in the works; one thing that users would want is whatever the 
>> others get but in the language they
>> desire with an abstraction similar to the other language bindings so that 
>> examples in other languages
>> could be readily translated to theirs.
>>
>> Whatever the abstraction turns out the be, there are at least 3 big 
>> things that Spark offers; simplification,
>> speed, and lazy evaluation.  The abstraction should not make that 
>> cumbersome.
>>
>> For me, the advantage of Julia is the syntax, the speed, and the 
>> connection to all of the Julia packages
>> and because of that the community of Julia package authors.  The 
>> advantage of Spark is the machinery
>> of Spark, access to mlib and likewise the community of Spark users.
>>
>> How about an example?  This is simply from Spark examples -- good old 
>> K-means.  This is assuming
>> the Python binding because probably Julia and Python are most alike, how 
>> would we expect this to 
>> look using Julia?
>>
>> from pyspark.mllib.clustering import KMeans
>> from numpy import array
>> from math import sqrt
>>
>> # Load and parse the data
>> data = sc.textFile("data/mllib/kmeans_data.txt")
>> parsedData = data.map(lambda line: array([float(x) for x in line.split(' 
>> ')]))
>>
>> # Build the model (cluster the data)
>> clusters = KMeans.train(parsedData, 2, maxIterations=10,
>>         runs=10, initializationMode="random")
>>
>> # Evaluate clustering by computing Within Set Sum of Squared Errors
>> def error(point):
>>     center = clusters.centers[clusters.predict(point)]
>>     return sqrt(sum([x**2 for x in (point - center)]))
>>
>> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
>> print("Within Set Sum of Squared Error = " + str(WSSSE))
>>
>>
>>
>>
>>

[julia-users] Re: Spark and Julia

Reply via email to