Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
it “into” Spark. Is Spark separate storage?

B.

From: Alex Kamil 
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Adaryl, 

most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage is using the model to 
classify new data, e.g. recommend similar items based on a log of new 
purchases, there you could once again use Spark or Storm with something like 
this.

Alex





On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  I’m planning to speak at a local meet-up and I need to know if what I have in 
my head is even possible.
  I want to give an example of working with data in Cassandra. I have data 
coming in through Kafka and Storm and I’m saving it off to Cassandra (this is 
only on paper at this point). I then want to run an ML algorithm over the data. 
My problem here is, while my data is distributed, I don’t know how to do the 
analysis in a distributed manner. I could certainly use R but processing the 
data on a single machine would seem to defeat the purpose of all this 
scalability.
  What is my solution?
  B.


Re: Machine Learning With Cassandra

2014-08-30 Thread Shahab Yunus
Spark is not storage, rather it is a streaming framework supposed to be run
on big data, distributed architecture (a very high-level intro/definition).
It provides batched version of in-memory map/reduce like jobs. It is not
completely streaming like Storm but rather batches collection of tuples and
thus you can run complex ML algorithms relatively faster.

I think we just discussed this a short while ago when similar question
(storm vs. spark, I think) was raised by you earlier. Here is the link for
that discussion:
http://markmail.org/message/lc4icuw4hobul6oh


Regards,
Shahab


On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Isn’t a bit overkill to use Storm and Spark in the architecture? You
 say load it “into” Spark. Is Spark separate storage?

 B.

  *From:* Alex Kamil alex.ka...@gmail.com
 *Sent:* Friday, August 29, 2014 10:46 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Machine Learning With Cassandra

  Adaryl,

 most ML algorithms  are based on some form of numerical optimization,
 using something like online gradient descent
 http://en.wikipedia.org/wiki/Stochastic_gradient_descent or conjugate
 gradient
 http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html (e.g
 in SVM classifiers). In its simplest form it is a nested FOR loop where on
 each iteration you update the weights or parameters of the model until
 reaching some convergence threshold that minimizes the prediction error
 (usually the goal is to minimize  a Loss function
 http://en.wikipedia.org/wiki/Loss_function, as in a popular least
 squares http://en.wikipedia.org/wiki/Least_squares technique). You
 could parallelize this loop using a brute force divide-and-conquer
 approach, mapping a chunk of data to each node and a computing partial sum
 there, then aggregating the results from each node into a global sum in a
 'reduce' stage, and repeating this map-reduce cycle until convergence. You
 can look up distributed gradient descent
 http://scholar.google.com/scholar?hl=enq=gradient+descent+with+map-reduc
 or check out Mahout
 https://mahout.apache.org/users/recommender/matrix-factorization.html
 or Spark MLlib https://spark.apache.org/docs/latest/mllib-guide.html
 for examples. Alternatively you can use something like GraphLab
 http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html
 .

 Cassandra can serve a data store from which you load the training data
 e.g. into Spark  using this connector
 https://github.com/datastax/spark-cassandra-connector and then train
 the model using MLlib or Mahout (it has Spark bindings I believe). Once you
 trained the model, you could save the parameters back in Cassandra. Then
 the next stage is using the model to classify new data, e.g. recommend
 similar items based on a log of new purchases, there you could once again
 use Spark or Storm with something like this
 https://github.com/pmerienne/trident-ml.

 Alex




 On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   I’m planning to speak at a local meet-up and I need to know if what I
 have in my head is even possible.

  I want to give an example of working with data in Cassandra. I have
 data coming in through Kafka and Storm and I’m saving it off to Cassandra
 (this is only on paper at this point). I then want to run an ML algorithm
 over the data. My problem here is, while my data is distributed, I don’t
 know how to do the analysis in a distributed manner. I could certainly use
 R but processing the data on a single machine would seem to defeat the
 purpose of all this scalability.

  What is my solution?
  B.





Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Yes I remember this conversation. That was when I was just first stepping into 
this stuff. My current understanding is:
Storm = Stream and micro batch
Spark  = Batch and micro batch

Micro batching is what gets you to exactly once processing semantics. I’m clear 
on that. What I’m not clear on is how and where processing takes place.

I also get the fact that Spark is a faster execution engine than MapReduce. But 
we have Tez now..except, as far as I know, that’s not useful here because my 
data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
Spark Shell but I’d really like to get this done with a minimum amount of 
software; either Storm or Spark but not both.  

Trident ML isn’t distributed which is fine because I’m not trying to do 
learning on the stream. For now, I’m just trying to do learning in batch and 
then update parameters as suggested earlier.

Let me simply the question. How do I do distributed machine learning when my 
data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a 
lot of the algorithms run on MapReduce which is fine for now. As I understand 
it though, MapReduce works on data in HDFS correct?

Adaryl Bob Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Shahab Yunus 
Sent: Saturday, August 30, 2014 11:23 AM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Spark is not storage, rather it is a streaming framework supposed to be run on 
big data, distributed architecture (a very high-level intro/definition). It 
provides batched version of in-memory map/reduce like jobs. It is not 
completely streaming like Storm but rather batches collection of tuples and 
thus you can run complex ML algorithms relatively faster.  

I think we just discussed this a short while ago when similar question (storm 
vs. spark, I think) was raised by you earlier. Here is the link for that 
discussion:
http://markmail.org/message/lc4icuw4hobul6oh



Regards,
Shahab



On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
it “into” Spark. Is Spark separate storage?

  B.

  From: Alex Kamil 
  Sent: Friday, August 29, 2014 10:46 PM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Adaryl, 

  most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

  Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage is using the model to 
classify new data, e.g. recommend similar items based on a log of new 
purchases, there you could once again use Spark or Storm with something like 
this.

  Alex





  On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

I’m planning to speak at a local meet-up and I need to know if what I have 
in my head is even possible.
I want to give an example of working with data in Cassandra. I have data 
coming in through Kafka and Storm and I’m saving it off to Cassandra (this is 
only on paper at this point). I then want to run an ML algorithm over the data. 
My problem here is, while my data is distributed, I don’t know how to do the 
analysis in a distributed manner. I could certainly use R but processing the 
data on a single machine would seem to defeat the purpose of all this 
scalability.
What is my solution?
B.



Re: Machine Learning With Cassandra

2014-08-30 Thread James Horey
If you want distributed machine learning, you can use either Mahout (runs on 
Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a 
connector (CFS) to interact with data stored in Cassandra. Otherwise you can 
try to use the Cassandra InputFormat (not as simple, but plenty of people use 
it). 

A quick search for “map reduce cassandra” on this list brings up a recent 
conversation: 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 

If you prefer to use Spark, you can try the Datastax Cassandra connector: 
https://github.com/datastax/spark-cassandra-connector. This should let you run 
Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

 Yes I remember this conversation. That was when I was just first stepping 
 into this stuff. My current understanding is:
 Storm = Stream and micro batch
 Spark  = Batch and micro batch
  
 Micro batching is what gets you to exactly once processing semantics. I’m 
 clear on that. What I’m not clear on is how and where processing takes place.
  
 I also get the fact that Spark is a faster execution engine than MapReduce. 
 But we have Tez now..except, as far as I know, that’s not useful here because 
 my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
 Spark Shell but I’d really like to get this done with a minimum amount of 
 software; either Storm or Spark but not both. 
  
 Trident ML isn’t distributed which is fine because I’m not trying to do 
 learning on the stream. For now, I’m just trying to do learning in batch and 
 then update parameters as suggested earlier.
  
 Let me simply the question. How do I do distributed machine learning when my 
 data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but 
 a lot of the algorithms run on MapReduce which is fine for now. As I 
 understand it though, MapReduce works on data in HDFS correct?
  
 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData
  
 From: Shahab Yunus
 Sent: Saturday, August 30, 2014 11:23 AM
 To: user@cassandra.apache.org
 Subject: Re: Machine Learning With Cassandra
  
 Spark is not storage, rather it is a streaming framework supposed to be run 
 on big data, distributed architecture (a very high-level intro/definition). 
 It provides batched version of in-memory map/reduce like jobs. It is not 
 completely streaming like Storm but rather batches collection of tuples and 
 thus you can run complex ML algorithms relatively faster. 
  
 I think we just discussed this a short while ago when similar question (storm 
 vs. spark, I think) was raised by you earlier. Here is the link for that 
 discussion:
 http://markmail.org/message/lc4icuw4hobul6oh
  
  
 Regards,
 Shahab
 
 
 On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
 it “into” Spark. Is Spark separate storage?
  
 B.
  
 From: Alex Kamil
 Sent: Friday, August 29, 2014 10:46 PM
 To: user@cassandra.apache.org
 Subject: Re: Machine Learning With Cassandra
  
 Adaryl,
  
 most ML algorithms  are based on some form of numerical optimization, using 
 something like online gradient descent or conjugate gradient (e.g in SVM 
 classifiers). In its simplest form it is a nested FOR loop where on each 
 iteration you update the weights or parameters of the model until reaching 
 some convergence threshold that minimizes the prediction error (usually the 
 goal is to minimize  a Loss function, as in a popular least squares 
 technique). You could parallelize this loop using a brute force 
 divide-and-conquer approach, mapping a chunk of data to each node and a 
 computing partial sum there, then aggregating the results from each node into 
 a global sum in a 'reduce' stage, and repeating this map-reduce cycle until 
 convergence. You can look up distributed gradient descent or check out Mahout 
 or Spark MLlib for examples. Alternatively you can use something like 
 GraphLab.
  
 Cassandra can serve a data store from which you load the training data e.g. 
 into Spark  using this connector and then train the model using MLlib or 
 Mahout (it has Spark bindings I believe). Once you trained the model, you 
 could save the parameters back in Cassandra. Then the next stage is using the 
 model to classify new data, e.g. recommend similar items based on a log of 
 new purchases, there you could once again use Spark or Storm with something 
 like this.
  
 Alex
  
  
 
 
 On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 I’m planning to speak at a local meet-up and I need to know if what I 

Re: Machine Learning With Cassandra

2014-08-30 Thread Adaryl Bob Wakefield, MBA
Ahh thanks. Yeah my searches for “machine learning with Cassandra” were not 
turning up much useful stuff.

Adaryl Bob Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: James Horey 
Sent: Saturday, August 30, 2014 3:34 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

If you want distributed machine learning, you can use either Mahout (runs on 
Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a 
connector (CFS) to interact with data stored in Cassandra. Otherwise you can 
try to use the Cassandra InputFormat (not as simple, but plenty of people use 
it). 

A quick search for “map reduce cassandra” on this list brings up a recent 
conversation: 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 

If you prefer to use Spark, you can try the Datastax Cassandra connector: 
https://github.com/datastax/spark-cassandra-connector. This should let you run 
Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io 
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:


  Yes I remember this conversation. That was when I was just first stepping 
into this stuff. My current understanding is:
  Storm = Stream and micro batch
  Spark  = Batch and micro batch

  Micro batching is what gets you to exactly once processing semantics. I’m 
clear on that. What I’m not clear on is how and where processing takes place.

  I also get the fact that Spark is a faster execution engine than MapReduce. 
But we have Tez now..except, as far as I know, that’s not useful here because 
my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
Spark Shell but I’d really like to get this done with a minimum amount of 
software; either Storm or Spark but not both.  

  Trident ML isn’t distributed which is fine because I’m not trying to do 
learning on the stream. For now, I’m just trying to do learning in batch and 
then update parameters as suggested earlier.

  Let me simply the question. How do I do distributed machine learning when my 
data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a 
lot of the algorithms run on MapReduce which is fine for now. As I understand 
it though, MapReduce works on data in HDFS correct?

  Adaryl Bob Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Shahab Yunus 
  Sent: Saturday, August 30, 2014 11:23 AM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Spark is not storage, rather it is a streaming framework supposed to be run 
on big data, distributed architecture (a very high-level intro/definition). It 
provides batched version of in-memory map/reduce like jobs. It is not 
completely streaming like Storm but rather batches collection of tuples and 
thus you can run complex ML algorithms relatively faster.  

  I think we just discussed this a short while ago when similar question (storm 
vs. spark, I think) was raised by you earlier. Here is the link for that 
discussion:
  http://markmail.org/message/lc4icuw4hobul6oh



  Regards,
  Shahab



  On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

Isn’t a bit overkill to use Storm and Spark in the architecture? You say 
load it “into” Spark. Is Spark separate storage?

B.

From: Alex Kamil 
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Adaryl, 

most ML algorithms  are based on some form of numerical optimization, using 
something like online gradient descent or conjugate gradient (e.g in SVM 
classifiers). In its simplest form it is a nested FOR loop where on each 
iteration you update the weights or parameters of the model until reaching some 
convergence threshold that minimizes the prediction error (usually the goal is 
to minimize  a Loss function, as in a popular least squares technique). You 
could parallelize this loop using a brute force divide-and-conquer approach, 
mapping a chunk of data to each node and a computing partial sum there, then 
aggregating the results from each node into a global sum in a 'reduce' stage, 
and repeating this map-reduce cycle until convergence. You can look up 
distributed gradient descent or check out Mahout or Spark MLlib for examples. 
Alternatively you can use something like GraphLab.

Cassandra can serve a data store from which you load the training data e.g. 
into Spark  using this connector and then train the model using MLlib or Mahout 
(it has Spark bindings I believe). Once you trained the model, you could save 
the parameters back in Cassandra. Then the next stage 

Re: Machine Learning With Cassandra

2014-08-30 Thread Peter Lin
there are other machine learning frameworks that scale better than hadoop +
mahout

http://hunch.net/~vw/

if the kind of machine learning you're doing is really large and speed
matters, take a look at vowpal wabbit




On Sat, Aug 30, 2014 at 4:58 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Ahh thanks. Yeah my searches for “machine learning with Cassandra” were
 not turning up much useful stuff.

 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData

  *From:* James Horey j...@opencore.io
 *Sent:* Saturday, August 30, 2014 3:34 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Machine Learning With Cassandra

  If you want distributed machine learning, you can use either Mahout
 (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax
 provides a connector (CFS) to interact with data stored in Cassandra.
 Otherwise you can try to use the Cassandra InputFormat (not as simple, but
 plenty of people use it).

 A quick search for “map reduce cassandra” on this list brings up a recent
 conversation:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7=Mi8q=5smrfnm1+kfexxvk+p...@mail.gmail.com%3E


 If you prefer to use Spark, you can try the Datastax Cassandra connector:
 https://github.com/datastax/spark-cassandra-connector. This should let
 you run Spark jobs using data to/from Cassandra.

 Cheers,
 James

 Web: http://ferry.opencore.io
 Twitter: @open_core_io

  On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Yes I remember this conversation. That was when I was just first
 stepping into this stuff. My current understanding is:
 Storm = Stream and micro batch
 Spark  = Batch and micro batch

 Micro batching is what gets you to exactly once processing semantics. I’m
 clear on that. What I’m not clear on is how and where processing takes
 place.

 I also get the fact that Spark is a faster execution engine than
 MapReduce. But we have Tez now..except, as far as I know, that’s not useful
 here because my data isn’t in HDFS. People seem to be talking quite a bit
 about Mahout and Spark Shell but I’d really like to get this done with a
 minimum amount of software; either Storm or Spark but not both.

 Trident ML isn’t distributed which is fine because I’m not trying to do
 learning on the stream. For now, I’m just trying to do learning in batch
 and then update parameters as suggested earlier.

 Let me simply the question. How do I do distributed machine learning when
 my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet
 but a lot of the algorithms run on MapReduce which is fine for now. As I
 understand it though, MapReduce works on data in HDFS correct?

 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData

  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Saturday, August 30, 2014 11:23 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Machine Learning With Cassandra

  Spark is not storage, rather it is a streaming framework supposed to be
 run on big data, distributed architecture (a very high-level
 intro/definition). It provides batched version of in-memory map/reduce like
 jobs. It is not completely streaming like Storm but rather batches
 collection of tuples and thus you can run complex ML algorithms relatively
 faster.

 I think we just discussed this a short while ago when similar question
 (storm vs. spark, I think) was raised by you earlier. Here is the link for
 that discussion:
 http://markmail.org/message/lc4icuw4hobul6oh


 Regards,
 Shahab


 On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Isn’t a bit overkill to use Storm and Spark in the architecture? You
 say load it “into” Spark. Is Spark separate storage?

 B.

  *From:* Alex Kamil alex.ka...@gmail.com
 *Sent:* Friday, August 29, 2014 10:46 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Machine Learning With Cassandra

  Adaryl,

 most ML algorithms  are based on some form of numerical optimization,
 using something like online gradient descent
 http://en.wikipedia.org/wiki/Stochastic_gradient_descent or conjugate
 gradient
 http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html
 (e.g in SVM classifiers). In its simplest form it is a nested FOR loop
 where on each iteration you update the weights or parameters of the model
 until reaching some convergence threshold that minimizes the prediction
 error (usually the goal is to minimize  a Loss function
 http://en.wikipedia.org/wiki/Loss_function, as in a popular least
 squares http://en.wikipedia.org/wiki/Least_squares technique). You
 

Help with migration from Thrift to CQL3 on Cassandra 2.0.10

2014-08-30 Thread Todd Nine
Hi all,
  I'm working on transferring our thrift DAOs over to CQL.  It's going
well, except for 2 cases that both use multi get.  The use case is very
simple.  It is a narrow row, by design, with only a few columns.  When I
perform a multiget, I need to get up to 1k rows at a time.  I do not want
to turn these into a wide row using scopeId and scopeType as the row key.


On the physical level, my Column Family needs something similar to the
following format.


scopeId, scopeType, nodeId, nodeType :{ timestamp: 0x00 }


I've defined by table with the following CQL.


CREATE TABLE IF NOT EXISTS Graph_Marked_Nodes
( scopeId uuid, scopeType varchar, nodeId uuid, nodeType varchar, timestamp
bigint,
PRIMARY KEY ((scopeId , scopeType, nodeId, nodeType))
)WITH caching = 'all'


This works well for inserts deletes and single reads.  I always know the
scopeId, scopeType, nodeId, and nodeType, so I want to return the timestamp
columns.  I thought I could use the IN operation and specify the pairs of
nodeId and nodeTypes I have as input, however this doesn't work.

Can anyone give me a suggestion on how to perform a multiget when I have
several values for the nodeId and the nodeType?  This read occurs on every
read of edges so making 1k trips is not going to work from a performance
perspective.

Below is the query I've tried.

SELECT timestamp FROM  Graph_Marked_Nodes WHERE scopeId = ? AND scopeType =
? AND nodeId IN (uuid1, uuid2, uuid3) AND nodeType IN ('foo','bar')

I've found this issue, which looks like it's a solution to my problem.

https://issues.apache.org/jira/browse/CASSANDRA-6875

However, I'm not able to get the syntax in the issue description to work
either.  Any input would be appreciated!

Cassandra: 2.0.10
Datastax Driver: 2.1.0

Thanks,
Todd


Re: Heterogenous cluster and vnodes

2014-08-30 Thread Ben Bromhead

 Hey,
 
 I have a few of VM host (bare metal) machines with varying amounts of free 
 hard drive space on them. For simplicity let’s say I have three machine like 
 so:
  * Machine 1
   - Harddrive 1: 150 GB available.
  * Machine 2:
   - Harddrive 1: 150 GB available.
   - Harddrive 2: 150 GB available.
  * Machine 3.
   - Harddrive 1: 150 GB available.
 
 I am setting up a Cassandra cluster between them and as I see it I have two 
 options:
 
 1. I set up one Cassandra node/VM per bare metal machine. I assign all free 
 hard drive space to each Cassandra node and I balance the cluster using 
 vnodes proportionally to the amount of free hard drive space (CPU/RAM is not 
 going to be a bottle neck here).
 
 2. I set up four VMs, each running a Cassandra node with equal amount of hard 
 drive space and equal amount of vnodes. Machine 2 runs two VMs.

This setup will potentially create a situation where if Machine 2 goes down you 
may lose two replicas. As the two VMs on Machine 2 might be replicas for the 
same key.

 
 General question: Is any of these preferable to the other? I understand 1) 
 yields lower high-availability (since nodes are on the same hardware).

Other way around (2 would be potentially lower availability)… Cassandra thinks 
two of the vm's are separate when they in fact rely on the same underlying 
machine.

 
 Question about alternative 1: With varying vnodes, can I always be sure that 
 replicas are never put on the same virtual machine?

Yes… mostly https://issues.apache.org/jira/browse/CASSANDRA-4123

 Or is varying vnodes really only useful/recommended when migrating from 
 machines with varying hardware (like mentioned in [1])?

Changing the number of vnodes changes the portion of the ring a node is 
responsible for. You can use it to account for different types of hardware, you 
can also use it for creating awesome situations like hotspots if you aren't 
careful… ymmv.

At the end of the day I would throw out the extra hard drive / not use it / put 
more hard drives in the other machines. Why? Hard drives are cheap and your 
time as an admin for the cluster isn't. If you do add more hard drives you can 
also split out the commit log etc onto different disks.

I would take less problems over trying to draw every last scrap of performance 
out of the available hardware any day of the year. 


Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359



Scala driver

2014-08-30 Thread Gary Zhao
Hi

Could you recommend a Scala driver and share your experiences of using it.
Im thinking if i use java driver in Scala directly.

Thanks