Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed architecture (a very high-level intro/definition). It provides batched version of in-memory map/reduce like jobs. It is not completely streaming like Storm but rather batches collection of tuples and thus you can run complex ML algorithms relatively faster.
I think we just discussed this a short while ago when similar question (storm vs. spark, I think) was raised by you earlier. Here is the link for that discussion: http://markmail.org/message/lc4icuw4hobul6oh Regards, Shahab On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA < adaryl.wakefi...@hotmail.com> wrote: > Isn’t a bit overkill to use Storm and Spark in the architecture? You > say load it “into” Spark. Is Spark separate storage? > > B. > > *From:* Alex Kamil <alex.ka...@gmail.com> > *Sent:* Friday, August 29, 2014 10:46 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Machine Learning With Cassandra > > Adaryl, > > most ML algorithms are based on some form of numerical optimization, > using something like online gradient descent > <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate > gradient > <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> (e.g > in SVM classifiers). In its simplest form it is a nested FOR loop where on > each iteration you update the weights or parameters of the model until > reaching some convergence threshold that minimizes the prediction error > (usually the goal is to minimize a Loss function > <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least > squares <http://en.wikipedia.org/wiki/Least_squares> technique). You > could parallelize this loop using a brute force divide-and-conquer > approach, mapping a chunk of data to each node and a computing partial sum > there, then aggregating the results from each node into a global sum in a > 'reduce' stage, and repeating this map-reduce cycle until convergence. You > can look up distributed gradient descent > <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc> > or check out Mahout > <https://mahout.apache.org/users/recommender/matrix-factorization.html> > or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html> > for examples. Alternatively you can use something like GraphLab > <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html> > . > > Cassandra can serve a data store from which you load the training data > e.g. into Spark using this connector > <https://github.com/datastax/spark-cassandra-connector> and then train > the model using MLlib or Mahout (it has Spark bindings I believe). Once you > trained the model, you could save the parameters back in Cassandra. Then > the next stage is using the model to classify new data, e.g. recommend > similar items based on a log of new purchases, there you could once again > use Spark or Storm with something like this > <https://github.com/pmerienne/trident-ml>. > > Alex > > > > > On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefi...@hotmail.com> wrote: > >> I’m planning to speak at a local meet-up and I need to know if what I >> have in my head is even possible. >> >> I want to give an example of working with data in Cassandra. I have >> data coming in through Kafka and Storm and I’m saving it off to Cassandra >> (this is only on paper at this point). I then want to run an ML algorithm >> over the data. My problem here is, while my data is distributed, I don’t >> know how to do the analysis in a distributed manner. I could certainly use >> R but processing the data on a single machine would seem to defeat the >> purpose of all this scalability. >> >> What is my solution? >> B. >> > >