Spark is not storage, rather it is a streaming framework supposed to be run
on big data, distributed architecture (a very high-level intro/definition).
It provides batched version of in-memory map/reduce like jobs. It is not
completely streaming like Storm but rather batches collection of tuples and
thus you can run complex ML algorithms relatively faster.

I think we just discussed this a short while ago when similar question
(storm vs. spark, I think) was raised by you earlier. Here is the link for
that discussion:
http://markmail.org/message/lc4icuw4hobul6oh


Regards,
Shahab


On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefi...@hotmail.com> wrote:

>   Isn’t a bit overkill to use Storm and Spark in the architecture? You
> say load it “into” Spark. Is Spark separate storage?
>
> B.
>
>  *From:* Alex Kamil <alex.ka...@gmail.com>
> *Sent:* Friday, August 29, 2014 10:46 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  Adaryl,
>
> most ML algorithms  are based on some form of numerical optimization,
> using something like online gradient descent
> <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
> gradient
> <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> (e.g
> in SVM classifiers). In its simplest form it is a nested FOR loop where on
> each iteration you update the weights or parameters of the model until
> reaching some convergence threshold that minimizes the prediction error
> (usually the goal is to minimize  a Loss function
> <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least
> squares <http://en.wikipedia.org/wiki/Least_squares> technique). You
> could parallelize this loop using a brute force divide-and-conquer
> approach, mapping a chunk of data to each node and a computing partial sum
> there, then aggregating the results from each node into a global sum in a
> 'reduce' stage, and repeating this map-reduce cycle until convergence. You
> can look up distributed gradient descent
> <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc>
> or check out Mahout
> <https://mahout.apache.org/users/recommender/matrix-factorization.html>
> or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html>
> for examples. Alternatively you can use something like GraphLab
> <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
> .
>
> Cassandra can serve a data store from which you load the training data
> e.g. into Spark  using this connector
> <https://github.com/datastax/spark-cassandra-connector> and then train
> the model using MLlib or Mahout (it has Spark bindings I believe). Once you
> trained the model, you could save the parameters back in Cassandra. Then
> the next stage is using the model to classify new data, e.g. recommend
> similar items based on a log of new purchases, there you could once again
> use Spark or Storm with something like this
> <https://github.com/pmerienne/trident-ml>.
>
> Alex
>
>
>
>
> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>>   I’m planning to speak at a local meet-up and I need to know if what I
>> have in my head is even possible.
>>
>>  I want to give an example of working with data in Cassandra. I have
>> data coming in through Kafka and Storm and I’m saving it off to Cassandra
>> (this is only on paper at this point). I then want to run an ML algorithm
>> over the data. My problem here is, while my data is distributed, I don’t
>> know how to do the analysis in a distributed manner. I could certainly use
>> R but processing the data on a single machine would seem to defeat the
>> purpose of all this scalability.
>>
>>  What is my solution?
>>  B.
>>
>
>

Reply via email to