there are other machine learning frameworks that scale better than hadoop +
mahout

http://hunch.net/~vw/

if the kind of machine learning you're doing is really large and speed
matters, take a look at vowpal wabbit




On Sat, Aug 30, 2014 at 4:58 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefi...@hotmail.com> wrote:

>   Ahh thanks. Yeah my searches for “machine learning with Cassandra” were
> not turning up much useful stuff.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* James Horey <j...@opencore.io>
> *Sent:* Saturday, August 30, 2014 3:34 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  If you want distributed machine learning, you can use either Mahout
> (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax
> provides a connector (CFS) to interact with data stored in Cassandra.
> Otherwise you can try to use the Cassandra InputFormat (not as simple, but
> plenty of people use it).
>
> A quick search for “map reduce cassandra” on this list brings up a recent
> conversation:
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7=Mi8q=5smrfnm1+kfexxvk+p...@mail.gmail.com%3E>
>
>
> If you prefer to use Spark, you can try the Datastax Cassandra connector:
> https://github.com/datastax/spark-cassandra-connector. This should let
> you run Spark jobs using data to/from Cassandra.
>
> Cheers,
> James
>
> Web: http://ferry.opencore.io
> Twitter: @open_core_io
>
>  On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>   Yes I remember this conversation. That was when I was just first
> stepping into this stuff. My current understanding is:
> Storm = Stream and micro batch
> Spark  = Batch and micro batch
>
> Micro batching is what gets you to exactly once processing semantics. I’m
> clear on that. What I’m not clear on is how and where processing takes
> place.
>
> I also get the fact that Spark is a faster execution engine than
> MapReduce. But we have Tez now..except, as far as I know, that’s not useful
> here because my data isn’t in HDFS. People seem to be talking quite a bit
> about Mahout and Spark Shell but I’d really like to get this done with a
> minimum amount of software; either Storm or Spark but not both.
>
> Trident ML isn’t distributed which is fine because I’m not trying to do
> learning on the stream. For now, I’m just trying to do learning in batch
> and then update parameters as suggested earlier.
>
> Let me simply the question. How do I do distributed machine learning when
> my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet
> but a lot of the algorithms run on MapReduce which is fine for now. As I
> understand it though, MapReduce works on data in HDFS correct?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Shahab Yunus <shahab.yu...@gmail.com>
> *Sent:* Saturday, August 30, 2014 11:23 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  Spark is not storage, rather it is a streaming framework supposed to be
> run on big data, distributed architecture (a very high-level
> intro/definition). It provides batched version of in-memory map/reduce like
> jobs. It is not completely streaming like Storm but rather batches
> collection of tuples and thus you can run complex ML algorithms relatively
> faster.
>
> I think we just discussed this a short while ago when similar question
> (storm vs. spark, I think) was raised by you earlier. Here is the link for
> that discussion:
> http://markmail.org/message/lc4icuw4hobul6oh
>
>
> Regards,
> Shahab
>
>
> On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>>   Isn’t a bit overkill to use Storm and Spark in the architecture? You
>> say load it “into” Spark. Is Spark separate storage?
>>
>> B.
>>
>>  *From:* Alex Kamil <alex.ka...@gmail.com>
>> *Sent:* Friday, August 29, 2014 10:46 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Machine Learning With Cassandra
>>
>>  Adaryl,
>>
>> most ML algorithms  are based on some form of numerical optimization,
>> using something like online gradient descent
>> <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
>> gradient
>> <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html>
>> (e.g in SVM classifiers). In its simplest form it is a nested FOR loop
>> where on each iteration you update the weights or parameters of the model
>> until reaching some convergence threshold that minimizes the prediction
>> error (usually the goal is to minimize  a Loss function
>> <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least
>> squares <http://en.wikipedia.org/wiki/Least_squares> technique). You
>> could parallelize this loop using a brute force divide-and-conquer
>> approach, mapping a chunk of data to each node and a computing partial sum
>> there, then aggregating the results from each node into a global sum in a
>> 'reduce' stage, and repeating this map-reduce cycle until convergence. You
>> can look up distributed gradient descent
>> <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc>
>> or check out Mahout
>> <https://mahout.apache.org/users/recommender/matrix-factorization.html>
>> or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html>
>> for examples. Alternatively you can use something like GraphLab
>> <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
>> .
>>
>> Cassandra can serve a data store from which you load the training data
>> e.g. into Spark  using this connector
>> <https://github.com/datastax/spark-cassandra-connector> and then train
>> the model using MLlib or Mahout (it has Spark bindings I believe). Once you
>> trained the model, you could save the parameters back in Cassandra. Then
>> the next stage is using the model to classify new data, e.g. recommend
>> similar items based on a log of new purchases, there you could once again
>> use Spark or Storm with something like this
>> <https://github.com/pmerienne/trident-ml>.
>>
>> Alex
>>
>>
>>
>>
>> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefi...@hotmail.com> wrote:
>>
>>>   I’m planning to speak at a local meet-up and I need to know if what I
>>> have in my head is even possible.
>>>
>>>  I want to give an example of working with data in Cassandra. I have
>>> data coming in through Kafka and Storm and I’m saving it off to Cassandra
>>> (this is only on paper at this point). I then want to run an ML algorithm
>>> over the data. My problem here is, while my data is distributed, I don’t
>>> know how to do the analysis in a distributed manner. I could certainly use
>>> R but processing the data on a single machine would seem to defeat the
>>> purpose of all this scalability.
>>>
>>>  What is my solution?
>>>  B.
>>>
>>
>>
>
>
>
>

Reply via email to