It would be nice to co-ordinate these efforts under the JuliaParallel organization.
-viral On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, [email protected] wrote: > > Spark integration is a tricky thing. Python and R bindings go in a great > length to map language specific functions into Spark JVM library calls. I > guess same could be done with JavaCall.jl package in a manner similar to > SparkR. Look at slide 20 from here: > http://spark-summit.org/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf. > > Spark is a clever distributed data access paradigm which grew from Hadoop > slowness and limitations. I believe that Julia could provide competitive > model for a distributed data storage given Julia's parallel computing > approach. Right now, I am writing Julia bindings for Mesos. The idea is to > provide, though ClusterManager, access to any Mesos-supervised distributed > system and run Julia code that environment. In conjuction with > DistributedArrays and DataFrames, it will create powerful toolbox for > building distributed systems. > > After all, machine learning on JVM, really?!. > > On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote: >> >> >> >> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote: >>> >>> I am changing the subject of this thread from GSOC to Spark. I was just >>> looking around and found this: >>> >>> https://github.com/d9w/Spark.jl >>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q> >>> >> >> Hey, wow, that's interesting, is this an attempt to reimplement Spark or >> create a binding? >> >> >>> >>> >> The real question is with all the various systems out there, what is the >>> level of abstraction that julia should work with. Julia's DataFrames is one >>> level of abstraction, which could also transparently map to csv files >>> (rather than doing readtable), or a database table, or an HBase table. Why >>> would Spark users want Julia, and why would Julia users want Spark? I guess >>> if we can nail this down - the rest of the integration is probably easy to >>> figure out. >>> >> >> As a potential user, I will try to answer in a few parts >> >> There are currently 3 official language bindings (Java, Scala, Python) >> and some unofficial ones as well >> and R in the works; one thing that users would want is whatever the >> others get but in the language they >> desire with an abstraction similar to the other language bindings so that >> examples in other languages >> could be readily translated to theirs. >> >> Whatever the abstraction turns out the be, there are at least 3 big >> things that Spark offers; simplification, >> speed, and lazy evaluation. The abstraction should not make that >> cumbersome. >> >> For me, the advantage of Julia is the syntax, the speed, and the >> connection to all of the Julia packages >> and because of that the community of Julia package authors. The >> advantage of Spark is the machinery >> of Spark, access to mlib and likewise the community of Spark users. >> >> How about an example? This is simply from Spark examples -- good old >> K-means. This is assuming >> the Python binding because probably Julia and Python are most alike, how >> would we expect this to >> look using Julia? >> >> from pyspark.mllib.clustering import KMeans >> from numpy import array >> from math import sqrt >> >> # Load and parse the data >> data = sc.textFile("data/mllib/kmeans_data.txt") >> parsedData = data.map(lambda line: array([float(x) for x in line.split(' >> ')])) >> >> # Build the model (cluster the data) >> clusters = KMeans.train(parsedData, 2, maxIterations=10, >> runs=10, initializationMode="random") >> >> # Evaluate clustering by computing Within Set Sum of Squared Errors >> def error(point): >> center = clusters.centers[clusters.predict(point)] >> return sqrt(sum([x**2 for x in (point - center)])) >> >> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) >> print("Within Set Sum of Squared Error = " + str(WSSSE)) >> >> >> >> >>
