sharding vs what cassandra does

2015-01-19 Thread Adaryl Bob Wakefield, MBA
It’s my understanding that the way Cassandra replicates data across nodes is 
NOT sharding. Can someone provide a better explanation or correct my 
understanding?
B.

installing cassandra

2014-12-20 Thread Adaryl Bob Wakefield, MBA
I have a three node cluster that I’m using to learn how to work with disturbed 
software. There is this thing called Puppet that helps you with deploying 
software. Can/should I use Puppet to install Cassandra on my cluster or is 
there some sort of built in network wide deployment in the install process 
already?

B.

how wide can wide rows get?

2014-11-13 Thread Adaryl Bob Wakefield, MBA
I’m struggling with this wide row business. Is there an upward limit on the 
number of columns you can have?

Adaryl Bob Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData 

Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
it “into” Spark. Is Spark separate storage?

B.

From: Alex Kamil 
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Adaryl, 

most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage is using the model to 
classify new data, e.g. recommend similar items based on a log of new 
purchases, there you could once again use Spark or Storm with something like 
this.

Alex





On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  I’m planning to speak at a local meet-up and I need to know if what I have in 
my head is even possible.
  I want to give an example of working with data in Cassandra. I have data 
coming in through Kafka and Storm and I’m saving it off to Cassandra (this is 
only on paper at this point). I then want to run an ML algorithm over the data. 
My problem here is, while my data is distributed, I don’t know how to do the 
analysis in a distributed manner. I could certainly use R but processing the 
data on a single machine would seem to defeat the purpose of all this 
scalability.
  What is my solution?
  B.


Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Yes I remember this conversation. That was when I was just first stepping into 
this stuff. My current understanding is:
Storm = Stream and micro batch
Spark  = Batch and micro batch

Micro batching is what gets you to exactly once processing semantics. I’m clear 
on that. What I’m not clear on is how and where processing takes place.

I also get the fact that Spark is a faster execution engine than MapReduce. But 
we have Tez now..except, as far as I know, that’s not useful here because my 
data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
Spark Shell but I’d really like to get this done with a minimum amount of 
software; either Storm or Spark but not both.  

Trident ML isn’t distributed which is fine because I’m not trying to do 
learning on the stream. For now, I’m just trying to do learning in batch and 
then update parameters as suggested earlier.

Let me simply the question. How do I do distributed machine learning when my 
data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a 
lot of the algorithms run on MapReduce which is fine for now. As I understand 
it though, MapReduce works on data in HDFS correct?

Adaryl Bob Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Shahab Yunus 
Sent: Saturday, August 30, 2014 11:23 AM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Spark is not storage, rather it is a streaming framework supposed to be run on 
big data, distributed architecture (a very high-level intro/definition). It 
provides batched version of in-memory map/reduce like jobs. It is not 
completely streaming like Storm but rather batches collection of tuples and 
thus you can run complex ML algorithms relatively faster.  

I think we just discussed this a short while ago when similar question (storm 
vs. spark, I think) was raised by you earlier. Here is the link for that 
discussion:
http://markmail.org/message/lc4icuw4hobul6oh



Regards,
Shahab



On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
it “into” Spark. Is Spark separate storage?

  B.

  From: Alex Kamil 
  Sent: Friday, August 29, 2014 10:46 PM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Adaryl, 

  most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

  Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage is using the model to 
classify new data, e.g. recommend similar items based on a log of new 
purchases, there you could once again use Spark or Storm with something like 
this.

  Alex





  On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

I’m planning to speak at a local meet-up and I need to know if what I have 
in my head is even possible.
I want to give an example of working with data in Cassandra. I have data 
coming in through Kafka and Storm and I’m saving it off to Cassandra (this is 
only on paper at this point). I then want to run an ML algorithm over the data. 
My problem here is, while my data is distributed, I don’t know how to do the 
analysis in a distributed manner. I could certainly use R but processing the 
data on a single machine would seem to defeat the purpose of all this 
scalability.
What is my solution?
B.



Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Ahh thanks. Yeah my searches for “machine learning with Cassandra” were not 
turning up much useful stuff.

Adaryl Bob Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: James Horey 
Sent: Saturday, August 30, 2014 3:34 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

If you want distributed machine learning, you can use either Mahout (runs on 
Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a 
connector (CFS) to interact with data stored in Cassandra. Otherwise you can 
try to use the Cassandra InputFormat (not as simple, but plenty of people use 
it). 

A quick search for “map reduce cassandra” on this list brings up a recent 
conversation: 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 

If you prefer to use Spark, you can try the Datastax Cassandra connector: 
https://github.com/datastax/spark-cassandra-connector. This should let you run 
Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io 
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:


  Yes I remember this conversation. That was when I was just first stepping 
into this stuff. My current understanding is:
  Storm = Stream and micro batch
  Spark  = Batch and micro batch

  Micro batching is what gets you to exactly once processing semantics. I’m 
clear on that. What I’m not clear on is how and where processing takes place.

  I also get the fact that Spark is a faster execution engine than MapReduce. 
But we have Tez now..except, as far as I know, that’s not useful here because 
my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
Spark Shell but I’d really like to get this done with a minimum amount of 
software; either Storm or Spark but not both.  

  Trident ML isn’t distributed which is fine because I’m not trying to do 
learning on the stream. For now, I’m just trying to do learning in batch and 
then update parameters as suggested earlier.

  Let me simply the question. How do I do distributed machine learning when my 
data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a 
lot of the algorithms run on MapReduce which is fine for now. As I understand 
it though, MapReduce works on data in HDFS correct?

  Adaryl Bob Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Shahab Yunus 
  Sent: Saturday, August 30, 2014 11:23 AM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Spark is not storage, rather it is a streaming framework supposed to be run 
on big data, distributed architecture (a very high-level intro/definition). It 
provides batched version of in-memory map/reduce like jobs. It is not 
completely streaming like Storm but rather batches collection of tuples and 
thus you can run complex ML algorithms relatively faster.  

  I think we just discussed this a short while ago when similar question (storm 
vs. spark, I think) was raised by you earlier. Here is the link for that 
discussion:
  http://markmail.org/message/lc4icuw4hobul6oh



  Regards,
  Shahab



  On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

Isn’t a bit overkill to use Storm and Spark in the architecture? You say 
load it “into” Spark. Is Spark separate storage?

B.

From: Alex Kamil 
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Adaryl, 

most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage

Machine Learning With Cassandra

2014-08-29 Thread Adaryl Bob Wakefield, MBA
I’m planning to speak at a local meet-up and I need to know if what I have in 
my head is even possible.
I want to give an example of working with data in Cassandra. I have data coming 
in through Kafka and Storm and I’m saving it off to Cassandra (this is only on 
paper at this point). I then want to run an ML algorithm over the data. My 
problem here is, while my data is distributed, I don’t know how to do the 
analysis in a distributed manner. I could certainly use R but processing the 
data on a single machine would seem to defeat the purpose of all this 
scalability.
What is my solution?
B.