there are other machine learning frameworks that scale better than hadoop + mahout
http://hunch.net/~vw/ if the kind of machine learning you're doing is really large and speed matters, take a look at vowpal wabbit On Sat, Aug 30, 2014 at 4:58 PM, Adaryl "Bob" Wakefield, MBA < adaryl.wakefi...@hotmail.com> wrote: > Ahh thanks. Yeah my searches for “machine learning with Cassandra” were > not turning up much useful stuff. > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > *From:* James Horey <j...@opencore.io> > *Sent:* Saturday, August 30, 2014 3:34 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Machine Learning With Cassandra > > If you want distributed machine learning, you can use either Mahout > (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax > provides a connector (CFS) to interact with data stored in Cassandra. > Otherwise you can try to use the Cassandra InputFormat (not as simple, but > plenty of people use it). > > A quick search for “map reduce cassandra” on this list brings up a recent > conversation: > http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E > <http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7=Mi8q=5smrfnm1+kfexxvk+p...@mail.gmail.com%3E> > > > If you prefer to use Spark, you can try the Datastax Cassandra connector: > https://github.com/datastax/spark-cassandra-connector. This should let > you run Spark jobs using data to/from Cassandra. > > Cheers, > James > > Web: http://ferry.opencore.io > Twitter: @open_core_io > > On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA < > adaryl.wakefi...@hotmail.com> wrote: > > Yes I remember this conversation. That was when I was just first > stepping into this stuff. My current understanding is: > Storm = Stream and micro batch > Spark = Batch and micro batch > > Micro batching is what gets you to exactly once processing semantics. I’m > clear on that. What I’m not clear on is how and where processing takes > place. > > I also get the fact that Spark is a faster execution engine than > MapReduce. But we have Tez now..except, as far as I know, that’s not useful > here because my data isn’t in HDFS. People seem to be talking quite a bit > about Mahout and Spark Shell but I’d really like to get this done with a > minimum amount of software; either Storm or Spark but not both. > > Trident ML isn’t distributed which is fine because I’m not trying to do > learning on the stream. For now, I’m just trying to do learning in batch > and then update parameters as suggested earlier. > > Let me simply the question. How do I do distributed machine learning when > my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet > but a lot of the algorithms run on MapReduce which is fine for now. As I > understand it though, MapReduce works on data in HDFS correct? > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > *From:* Shahab Yunus <shahab.yu...@gmail.com> > *Sent:* Saturday, August 30, 2014 11:23 AM > *To:* user@cassandra.apache.org > *Subject:* Re: Machine Learning With Cassandra > > Spark is not storage, rather it is a streaming framework supposed to be > run on big data, distributed architecture (a very high-level > intro/definition). It provides batched version of in-memory map/reduce like > jobs. It is not completely streaming like Storm but rather batches > collection of tuples and thus you can run complex ML algorithms relatively > faster. > > I think we just discussed this a short while ago when similar question > (storm vs. spark, I think) was raised by you earlier. Here is the link for > that discussion: > http://markmail.org/message/lc4icuw4hobul6oh > > > Regards, > Shahab > > > On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefi...@hotmail.com> wrote: > >> Isn’t a bit overkill to use Storm and Spark in the architecture? You >> say load it “into” Spark. Is Spark separate storage? >> >> B. >> >> *From:* Alex Kamil <alex.ka...@gmail.com> >> *Sent:* Friday, August 29, 2014 10:46 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Machine Learning With Cassandra >> >> Adaryl, >> >> most ML algorithms are based on some form of numerical optimization, >> using something like online gradient descent >> <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate >> gradient >> <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> >> (e.g in SVM classifiers). In its simplest form it is a nested FOR loop >> where on each iteration you update the weights or parameters of the model >> until reaching some convergence threshold that minimizes the prediction >> error (usually the goal is to minimize a Loss function >> <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least >> squares <http://en.wikipedia.org/wiki/Least_squares> technique). You >> could parallelize this loop using a brute force divide-and-conquer >> approach, mapping a chunk of data to each node and a computing partial sum >> there, then aggregating the results from each node into a global sum in a >> 'reduce' stage, and repeating this map-reduce cycle until convergence. You >> can look up distributed gradient descent >> <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc> >> or check out Mahout >> <https://mahout.apache.org/users/recommender/matrix-factorization.html> >> or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html> >> for examples. Alternatively you can use something like GraphLab >> <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html> >> . >> >> Cassandra can serve a data store from which you load the training data >> e.g. into Spark using this connector >> <https://github.com/datastax/spark-cassandra-connector> and then train >> the model using MLlib or Mahout (it has Spark bindings I believe). Once you >> trained the model, you could save the parameters back in Cassandra. Then >> the next stage is using the model to classify new data, e.g. recommend >> similar items based on a log of new purchases, there you could once again >> use Spark or Storm with something like this >> <https://github.com/pmerienne/trident-ml>. >> >> Alex >> >> >> >> >> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA < >> adaryl.wakefi...@hotmail.com> wrote: >> >>> I’m planning to speak at a local meet-up and I need to know if what I >>> have in my head is even possible. >>> >>> I want to give an example of working with data in Cassandra. I have >>> data coming in through Kafka and Storm and I’m saving it off to Cassandra >>> (this is only on paper at this point). I then want to run an ML algorithm >>> over the data. My problem here is, while my data is distributed, I don’t >>> know how to do the analysis in a distributed manner. I could certainly use >>> R but processing the data on a single machine would seem to defeat the >>> purpose of all this scalability. >>> >>> What is my solution? >>> B. >>> >> >> > > > >