Re: batch_size_warn_threshold_in_kb

2014-12-16 Thread Eric Stevens
 the driver determine what degree of batching and 
 asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to
 what batch size is most optimal for the cluster, like number of 
 mutations
 in a batch and number of simultaneous connections, and to have that 
 be
 dynamic based on overall cluster load.

 I would also note that the example in the spec has multiple
 inserts with different partition key values, which flies in the face 
 of the
 admonition to to refrain from using server-side distribution of 
 requests.

 At a minimum the CQL spec should make a more clear statement of
 intent and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla
 rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's
 original post is that batches are not there for performance.  The 
 only case
 I consider batches to be useful for is when you absolutely need to 
 know
 that several tables all get a mutation (via logged batches).  The 
 use case
 for this is when you've got multiple tables that are serving as 
 different
 views for data.  It is absolutely not going to help you if you're 
 trying to
 lump queries together to reduce network  server overhead - in fact 
 it'll
 do the opposite.  If you're trying to do that, instead perform many 
 async
 queries.  The overhead of batches in cassandra is significant and 
 you're
 going to hit a lot of problems if you use them excessively (timeouts 
 /
 failures).

 tl;dr: you probably don't want batch, you most likely want many
 async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list.
 However, I didn’t see any information about why 5kb+ data will cause
 instability. 5kb or even 50kb seems too small. For example, if each
 mutation is 1000+ bytes, then with just 5 mutations, you will hit 
 that
 threshold.



 In addition, Patrick is saying that he does not recommend more
 than 100 mutations per batch. So why not warn users just on the # of
 mutations in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can
 find the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't
 recommend more than ~100 mutations per batch. Doing some quick math 
 I came
 up with 5k as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part
 because so many people confuse the BATCH keyword as a performance
 optimization, this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the
 yaml file, it is used to log WARN on any batch size exceeding this 
 value in
 kilobytes. It says caution should be taken on increasing the size 
 of this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most 
 innovative
 enterprises. Datastax is built to be agile, always-on, and 
 predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for 
 the
 worlds most innovative companies such as Netflix, Adobe, Intuit, 
 and eBay.





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for the
 worlds most innovative

Re: batch_size_warn_threshold_in_kb

2014-12-15 Thread Eric Stevens
...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know 
 that
 several tables all get a mutation (via logged batches).  The use case 
 for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to 
 lump
 queries together to reduce network  server overhead - in fact it'll 
 do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and 
 you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many
 async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list.
 However, I didn’t see any information about why 5kb+ data will cause
 instability. 5kb or even 50kb seems too small. For example, if each
 mutation is 1000+ bytes, then with just 5 mutations, you will hit that
 threshold.



 In addition, Patrick is saying that he does not recommend more
 than 100 mutations per batch. So why not warn users just on the # of
 mutations in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find
 the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't
 recommend more than ~100 mutations per batch. Doing some quick math I 
 came
 up with 5k as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part
 because so many people confuse the BATCH keyword as a performance
 optimization, this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value in
 kilobytes. It says caution should be taken on increasing the size of 
 this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for 
 the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and 
 eBay.





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.






Re: batch_size_warn_threshold_in_kb

2014-12-15 Thread Jonathan Haddad
 with 
 your
 statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a
 change to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple
 statements can save network exchanges between the client/server and 
 server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is 
 usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which
 is simply a way to collect “batches” of operations in the 
 client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to
 what batch size is most optimal for the cluster, like number of 
 mutations
 in a batch and number of simultaneous connections, and to have that be
 dynamic based on overall cluster load.

 I would also note that the example in the spec has multiple
 inserts with different partition key values, which flies in the face 
 of the
 admonition to to refrain from using server-side distribution of 
 requests.

 At a minimum the CQL spec should make a more clear statement of
 intent and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla
 rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's
 original post is that batches are not there for performance.  The 
 only case
 I consider batches to be useful for is when you absolutely need to 
 know
 that several tables all get a mutation (via logged batches).  The use 
 case
 for this is when you've got multiple tables that are serving as 
 different
 views for data.  It is absolutely not going to help you if you're 
 trying to
 lump queries together to reduce network  server overhead - in fact 
 it'll
 do the opposite.  If you're trying to do that, instead perform many 
 async
 queries.  The overhead of batches in cassandra is significant and 
 you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many
 async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list.
 However, I didn’t see any information about why 5kb+ data will cause
 instability. 5kb or even 50kb seems too small. For example, if each
 mutation is 1000+ bytes, then with just 5 mutations, you will hit 
 that
 threshold.



 In addition, Patrick is saying that he does not recommend more
 than 100 mutations per batch. So why not warn users just on the # of
 mutations in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can
 find the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't
 recommend more than ~100 mutations per batch. Doing some quick math 
 I came
 up with 5k as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part
 because so many people confuse the BATCH keyword as a performance
 optimization, this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value 
 in
 kilobytes. It says caution should be taken on increasing the size of 
 this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most 
 innovative
 enterprises. Datastax is built to be agile, always

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jack Krupansky
Jonathan and Ryan,

Jonathan says “It is absolutely not going to help you if you're trying to lump 
queries together to reduce network  server overhead - in fact it'll do the 
opposite”, but I would note that the CQL3 spec says “The BATCH statement ... 
serves several purposes: 1. It saves network round-trips between the client and 
the server (and sometimes between the server coordinator and the replicas) when 
batching multiple updates.” Is the spec inaccurate? I mean, it seems in 
conflict with your statement.

See:
https://cassandra.apache.org/doc/cql3/CQL.html

I see the spec as gospel – if it’s not accurate, let’s propose a change to make 
it accurate.

The DataStax CQL doc is more nuanced: “Batching multiple statements can save 
network exchanges between the client/server and server coordinator/replicas. 
However, because of the distributed nature of Cassandra, spread requests across 
nearby nodes as much as possible to optimize performance. Using batches to 
optimize performance is usually not successful, as described in Using and 
misusing batches section. For information about the fastest way to load data, 
see Cassandra: Batch loading without the Batch keyword.”

Maybe what we really need is a “client/driver-side batch”, which is simply a 
way to collect “batches” of operations in the client/driver and then let the 
driver determine what degree of batching and asynchronous operation is 
appropriate.

It might also be nice to have an inquiry for the cluster as to what batch size 
is most optimal for the cluster, like number of mutations in a batch and number 
of simultaneous connections, and to have that be dynamic based on overall 
cluster load.

I would also note that the example in the spec has multiple inserts with 
different partition key values, which flies in the face of the admonition to to 
refrain from using server-side distribution of requests.

At a minimum the CQL spec should make a more clear statement of intent and 
non-intent for BATCH.

-- Jack Krupansky

From: Jonathan Haddad 
Sent: Friday, December 12, 2014 12:58 PM
To: user@cassandra.apache.org ; Ryan Svihla 
Subject: Re: batch_size_warn_threshold_in_kb

The really important thing to really take away from Ryan's original post is 
that batches are not there for performance.  The only case I consider batches 
to be useful for is when you absolutely need to know that several tables all 
get a mutation (via logged batches).  The use case for this is when you've got 
multiple tables that are serving as different views for data.  It is absolutely 
not going to help you if you're trying to lump queries together to reduce 
network  server overhead - in fact it'll do the opposite.  If you're trying to 
do that, instead perform many async queries.  The overhead of batches in 
cassandra is significant and you're going to hit a lot of problems if you use 
them excessively (timeouts / failures). 

tl;dr: you probably don't want batch, you most likely want many async calls


On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com 
wrote:

  Ryan,

  Thanks for the quick response.



  I did see that jira before posting my question on this list. However, I 
didn’t see any information about why 5kb+ data will cause instability. 5kb or 
even 50kb seems too small. For example, if each mutation is 1000+ bytes, then 
with just 5 mutations, you will hit that threshold. 



  In addition, Patrick is saying that he does not recommend more than 100 
mutations per batch. So why not warn users just on the # of mutations in a 
batch?



  Mohammed



  From: Ryan Svihla [mailto:rsvi...@datastax.com] 
  Sent: Thursday, December 11, 2014 12:56 PM
  To: user@cassandra.apache.org
  Subject: Re: batch_size_warn_threshold_in_kb



  Nothing magic, just put in there based on experience. You can find the story 
behind the original recommendation here



  https://issues.apache.org/jira/browse/CASSANDRA-6487



  Key reasoning for the desire comes from Patrick McFadden:


  Yes that was in bytes. Just in my own experience, I don't recommend more 
than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 
50 byte mutations.

  Totally up for debate.



  It's totally changeable, however, it's there in no small part because so many 
people confuse the BATCH keyword as a performance optimization, this helps flag 
those cases of misuse.



  On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com 
wrote:

  Hi – 

  The cassandra.yaml file has property called batch_size_warn_threshold_in_kb. 

  The default size is 5kb and according to the comments in the yaml file, it is 
used to log WARN on any batch size exceeding this value in kilobytes. It says 
caution should be taken on increasing the size of this threshold as it can lead 
to node instability.



  Does anybody know the significance of this magic number 5kb? Why would a 
higher number (say 10kb) lead to node instability?



  Mohammed

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
test5 ((aid, bckt, end)) =
380,737,675,612

 Execution Results for 25 runs of 5 records =
25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
of 100 using strategy parallel
Total Run Time
test3 ((aid, bckt), end, proto) reverse order=
20,971,045,814
test1 ((aid, bckt), proto, end) reverse order=
21,379,583,690
test4 ((aid, bckt), proto, end) no explicit ordering =
21,505,965,087
test2 ((aid, bckt), end) =
24,433,580,144
test5 ((aid, bckt, end)) =
37,346,062,553



On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com wrote:

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async calls


 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
 wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100 mutations per batch. Doing some quick math I came up with 5k as
 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –

 The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file,
 it is used to log WARN on any batch size exceeding this value in kilobytes.
 It says caution should be taken on increasing the size of this threshold as
 it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.






Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jonathan Haddad
There are cases where it can.  For instance, if you batch multiple
mutations to the same partition (and talk to a replica for that partition)
they can reduce network overhead because they're effectively a single
mutation in the eye of the cluster.  However, if you're not doing that (and
most people aren't!) you end up putting additional pressure on the
coordinator because now it has to talk to several other servers.  If you
have 100 servers, and perform a mutation on 100 partitions, you could have
a coordinator that's

1) talking to every machine in the cluster and
b) waiting on a response from a significant portion of them

before it can respond success or fail.  Any delay, from GC to a bad disk,
can affect the performance of the entire batch.

On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com
wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're trying to
 lump queries together to reduce network  server overhead - in fact it'll
 do the opposite”, but I would note that the CQL3 spec says “The BATCH 
 statement
 ... serves several purposes: 1. It saves network round-trips between the
 client and the server (and sometimes between the server coordinator and the
 replicas) when batching multiple updates.” Is the spec inaccurate? I mean,
 it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a change to
 make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements can
 save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is simply
 a way to collect “batches” of operations in the client/driver and then let
 the driver determine what degree of batching and asynchronous operation is
 appropriate.

 It might also be nice to have an inquiry for the cluster as to what batch
 size is most optimal for the cluster, like number of mutations in a batch
 and number of simultaneous connections, and to have that be dynamic based
 on overall cluster load.

 I would also note that the example in the spec has multiple inserts with
 different partition key values, which flies in the face of the admonition
 to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent and
 non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
 wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Ryan Svihla
 strategy traverse
 Total Run Time
 test3 ((aid, bckt), end, proto) reverse order=
 9,633,141,094
 test4 ((aid, bckt), proto, end) no explicit ordering =
 12,519,381,544
 test2 ((aid, bckt), end) =
 12,653,843,637
 test1 ((aid, bckt), proto, end) reverse order=
 17,644,182,274
 test5 ((aid, bckt, end)) =
 27,902,501,534

  Execution Results for 25 runs of 5 records =
 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
 statements using strategy parallel
 Total Run Time
 test1 ((aid, bckt), proto, end) reverse order=
 360,523,086,443
 test3 ((aid, bckt), end, proto) reverse order=
 364,375,212,413
 test4 ((aid, bckt), proto, end) no explicit ordering =
 370,989,615,452
 test2 ((aid, bckt), end) =
 378,368,728,469
 test5 ((aid, bckt, end)) =
 380,737,675,612

  Execution Results for 25 runs of 5 records =
 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
 of 100 using strategy parallel
 Total Run Time
 test3 ((aid, bckt), end, proto) reverse order=
 20,971,045,814
 test1 ((aid, bckt), proto, end) reverse order=
 21,379,583,690
 test4 ((aid, bckt), proto, end) no explicit ordering =
 21,505,965,087
 test2 ((aid, bckt), end) =
 24,433,580,144
 test5 ((aid, bckt, end)) =
 37,346,062,553



 On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls


 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
 wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend
 more than ~100 mutations per batch. Doing some quick math I came up with 5k
 as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file,
 it is used to log WARN on any batch size exceeding this value in kilobytes.
 It says caution should be taken on increasing the size of this threshold as
 it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Ryan Svihla
) =
 52,676,830,110
 test4 ((aid, bckt), proto, end) no explicit ordering =
 54,096,838,258
 test5 ((aid, bckt, end)) =
 54,657,464,976
 test3 ((aid, bckt), end, proto) reverse order=
 55,668,202,827

  Execution Results for 25 runs of 5 records =
 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
 of 100 using strategy traverse
 Total Run Time
 test3 ((aid, bckt), end, proto) reverse order=
 9,633,141,094
 test4 ((aid, bckt), proto, end) no explicit ordering =
 12,519,381,544
 test2 ((aid, bckt), end) =
 12,653,843,637
 test1 ((aid, bckt), proto, end) reverse order=
 17,644,182,274
 test5 ((aid, bckt, end)) =
 27,902,501,534

  Execution Results for 25 runs of 5 records =
 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single
 statements using strategy parallel
 Total Run Time
 test1 ((aid, bckt), proto, end) reverse order=
 360,523,086,443
 test3 ((aid, bckt), end, proto) reverse order=
 364,375,212,413
 test4 ((aid, bckt), proto, end) no explicit ordering =
 370,989,615,452
 test2 ((aid, bckt), end) =
 378,368,728,469
 test5 ((aid, bckt, end)) =
 380,737,675,612

  Execution Results for 25 runs of 5 records =
 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches
 of 100 using strategy parallel
 Total Run Time
 test3 ((aid, bckt), end, proto) reverse order=
 20,971,045,814
 test1 ((aid, bckt), proto, end) reverse order=
 21,379,583,690
 test4 ((aid, bckt), proto, end) no explicit ordering =
 21,505,965,087
 test2 ((aid, bckt), end) =
 24,433,580,144
 test5 ((aid, bckt, end)) =
 37,346,062,553



 On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls


 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend
 more than ~100 mutations per batch. Doing some quick math I came up with 5k
 as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because
 so many people confuse the BATCH keyword as a performance optimization,
 this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file,
 it is used to log WARN on any batch size exceeding this value in kilobytes.
 It says caution should be taken on increasing the size of this threshold as
 it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would
 a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jonathan Haddad
To add to Ryan's (extremely valid!) point, your test works because the
coordinator is always a replica.  Try again using 20 (or 50) nodes.
Batching works great at RF=N=3 because it always gets to write to local and
talk to exactly 2 other servers on every request.  Consider what happens
when the coordinator needs to talk to 100 servers.  It's unnecessary
overhead on the server side.

To save network overhead, Cassandra 2.1 added support for response grouping
(see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
which massively helps performance.  It provides the benefit of batches but
without the coordinator overhead.

Can you post your benchmark code?

On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote:

 There are cases where it can.  For instance, if you batch multiple
 mutations to the same partition (and talk to a replica for that partition)
 they can reduce network overhead because they're effectively a single
 mutation in the eye of the cluster.  However, if you're not doing that (and
 most people aren't!) you end up putting additional pressure on the
 coordinator because now it has to talk to several other servers.  If you
 have 100 servers, and perform a mutation on 100 partitions, you could have
 a coordinator that's

 1) talking to every machine in the cluster and
 b) waiting on a response from a significant portion of them

 before it can respond success or fail.  Any delay, from GC to a bad disk,
 can affect the performance of the entire batch.


 On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com
 wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're trying to
 lump queries together to reduce network  server overhead - in fact it'll
 do the opposite”, but I would note that the CQL3 spec says “The BATCH 
 statement
 ... serves several purposes: 1. It saves network round-trips between the
 client and the server (and sometimes between the server coordinator and the
 replicas) when batching multiple updates.” Is the spec inaccurate? I mean,
 it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a change
 to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements can
 save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what batch
 size is most optimal for the cluster, like number of mutations in a batch
 and number of simultaneous connections, and to have that be dynamic based
 on overall cluster load.

 I would also note that the example in the spec has multiple inserts with
 different partition key values, which flies in the face of the admonition
 to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent
 and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
 wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Ryan Svihla
Also..what happens when you turn on shuffle with token aware?
http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html

On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com wrote:

 To add to Ryan's (extremely valid!) point, your test works because the
 coordinator is always a replica.  Try again using 20 (or 50) nodes.
 Batching works great at RF=N=3 because it always gets to write to local and
 talk to exactly 2 other servers on every request.  Consider what happens
 when the coordinator needs to talk to 100 servers.  It's unnecessary
 overhead on the server side.

 To save network overhead, Cassandra 2.1 added support for response
 grouping (see
 http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which
 massively helps performance.  It provides the benefit of batches but
 without the coordinator overhead.

 Can you post your benchmark code?

 On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 There are cases where it can.  For instance, if you batch multiple
 mutations to the same partition (and talk to a replica for that partition)
 they can reduce network overhead because they're effectively a single
 mutation in the eye of the cluster.  However, if you're not doing that (and
 most people aren't!) you end up putting additional pressure on the
 coordinator because now it has to talk to several other servers.  If you
 have 100 servers, and perform a mutation on 100 partitions, you could have
 a coordinator that's

 1) talking to every machine in the cluster and
 b) waiting on a response from a significant portion of them

 before it can respond success or fail.  Any delay, from GC to a bad disk,
 can affect the performance of the entire batch.


 On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com
 wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're trying
 to lump queries together to reduce network  server overhead - in fact
 it'll do the opposite”, but I would note that the CQL3 spec says “The
 BATCH statement ... serves several purposes: 1. It saves network
 round-trips between the client and the server (and sometimes between the
 server coordinator and the replicas) when batching multiple updates.” Is
 the spec inaccurate? I mean, it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a change
 to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements can
 save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in a
 batch and number of simultaneous connections, and to have that be dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts with
 different partition key values, which flies in the face of the admonition
 to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent
 and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original post
 is that batches are not there for performance.  The only case I consider
 batches to be useful for is when you absolutely need to know that several
 tables all get a mutation (via logged batches).  The use case for this is
 when you've got multiple tables that are serving as different views for
 data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
:33 AM Jack Krupansky j...@basetechnology.com
 wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're trying
 to lump queries together to reduce network  server overhead - in fact
 it'll do the opposite”, but I would note that the CQL3 spec says “The
 BATCH statement ... serves several purposes: 1. It saves network
 round-trips between the client and the server (and sometimes between the
 server coordinator and the replicas) when batching multiple updates.” Is
 the spec inaccurate? I mean, it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a change
 to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements can
 save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in a
 batch and number of simultaneous connections, and to have that be dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts
 with different partition key values, which flies in the face of the
 admonition to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent
 and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know that
 several tables all get a mutation (via logged batches).  The use case for
 this is when you've got multiple tables that are serving as different views
 for data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However,
 I didn’t see any information about why 5kb+ data will cause instability.
 5kb or even 50kb seems too small. For example, if each mutation is 1000+
 bytes, then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than
 100 mutations per batch. So why not warn users just on the # of mutations
 in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend
 more than ~100 mutations per batch. Doing some quick math I came up with 
 5k
 as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because
 so many people confuse the BATCH keyword as a performance optimization,
 this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value in
 kilobytes. It says caution should be taken on increasing the size of this
 threshold as it can lead to node instability.



 Does anybody know the significance

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jonathan Haddad
-over-50-faster)
 which massively helps performance.  It provides the benefit of batches but
 without the coordinator overhead.

 Can you post your benchmark code?

 On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 There are cases where it can.  For instance, if you batch multiple
 mutations to the same partition (and talk to a replica for that partition)
 they can reduce network overhead because they're effectively a single
 mutation in the eye of the cluster.  However, if you're not doing that (and
 most people aren't!) you end up putting additional pressure on the
 coordinator because now it has to talk to several other servers.  If you
 have 100 servers, and perform a mutation on 100 partitions, you could have
 a coordinator that's

 1) talking to every machine in the cluster and
 b) waiting on a response from a significant portion of them

 before it can respond success or fail.  Any delay, from GC to a bad
 disk, can affect the performance of the entire batch.


 On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky 
 j...@basetechnology.com wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're trying
 to lump queries together to reduce network  server overhead - in fact
 it'll do the opposite”, but I would note that the CQL3 spec says “The
 BATCH statement ... serves several purposes: 1. It saves network
 round-trips between the client and the server (and sometimes between the
 server coordinator and the replicas) when batching multiple updates.” Is
 the spec inaccurate? I mean, it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a
 change to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements
 can save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in a
 batch and number of simultaneous connections, and to have that be dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts
 with different partition key values, which flies in the face of the
 admonition to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent
 and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know that
 several tables all get a mutation (via logged batches).  The use case for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do 
 the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However,
 I didn’t see any information about why 5kb+ data will cause instability.
 5kb or even 50kb seems too small. For example, if each mutation is 1000+
 bytes, then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than
 100 mutations per batch. So why not warn users just on the # of mutations
 in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in a
 batch and number of simultaneous connections, and to have that be dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts
 with different partition key values, which flies in the face of the
 admonition to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of intent
 and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know that
 several tables all get a mutation (via logged batches).  The use case for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to lump
 queries together to reduce network  server overhead - in fact it'll do 
 the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many async
 calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However,
 I didn’t see any information about why 5kb+ data will cause instability.
 5kb or even 50kb seems too small. For example, if each mutation is 1000+
 bytes, then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than
 100 mutations per batch. So why not warn users just on the # of mutations
 in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find
 the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend
 more than ~100 mutations per batch. Doing some quick math I came up with 
 5k
 as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because
 so many people confuse the BATCH keyword as a performance optimization,
 this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value in
 kilobytes. It says caution should be taken on increasing the size of this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and 
 eBay.





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
 to resolve our internal
 dependencies.  This may not be today though.

 Also, @Ryan, I don't think that shuffling would make a difference for my
 above tests since as Jon observed, all my nodes were already replicas there.


 On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Also..what happens when you turn on shuffle with token aware?
 http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html

 On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 To add to Ryan's (extremely valid!) point, your test works because the
 coordinator is always a replica.  Try again using 20 (or 50) nodes.
 Batching works great at RF=N=3 because it always gets to write to local and
 talk to exactly 2 other servers on every request.  Consider what happens
 when the coordinator needs to talk to 100 servers.  It's unnecessary
 overhead on the server side.

 To save network overhead, Cassandra 2.1 added support for response
 grouping (see
 http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
 which massively helps performance.  It provides the benefit of batches but
 without the coordinator overhead.

 Can you post your benchmark code?

 On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 There are cases where it can.  For instance, if you batch multiple
 mutations to the same partition (and talk to a replica for that partition)
 they can reduce network overhead because they're effectively a single
 mutation in the eye of the cluster.  However, if you're not doing that 
 (and
 most people aren't!) you end up putting additional pressure on the
 coordinator because now it has to talk to several other servers.  If you
 have 100 servers, and perform a mutation on 100 partitions, you could have
 a coordinator that's

 1) talking to every machine in the cluster and
 b) waiting on a response from a significant portion of them

 before it can respond success or fail.  Any delay, from GC to a bad
 disk, can affect the performance of the entire batch.


 On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky 
 j...@basetechnology.com wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're
 trying to lump queries together to reduce network  server overhead - in
 fact it'll do the opposite”, but I would note that the CQL3 spec says “
 The BATCH statement ... serves several purposes: 1. It saves network
 round-trips between the client and the server (and sometimes between the
 server coordinator and the replicas) when batching multiple updates.” Is
 the spec inaccurate? I mean, it seems in conflict with your statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a
 change to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements
 can save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually 
 not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in a
 batch and number of simultaneous connections, and to have that be dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts
 with different partition key values, which flies in the face of the
 admonition to to refrain from using server-side distribution of requests.

 At a minimum the CQL spec should make a more clear statement of
 intent and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know 
 that
 several tables all get a mutation (via logged batches).  The use case for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to 
 lump
 queries together to reduce network  server overhead - in fact it'll do 
 the
 opposite.  If you're trying to do that, instead perform many async

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
).iterator().next
 }
   }
   val result =
 Future.traverse(groupByFirstReplica(statements).values).map(st =
 newBatch(st).executeAsync())


 Let me get together my test code, it depends on some existing utilities
 we use elsewhere, such as implicit conversions between Google and Scala
 native futures.  I'll try to put this together in a format that's runnable
 for you in a Scala REPL console without having to resolve our internal
 dependencies.  This may not be today though.

 Also, @Ryan, I don't think that shuffling would make a difference for my
 above tests since as Jon observed, all my nodes were already replicas there.


 On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Also..what happens when you turn on shuffle with token aware?
 http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html

 On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 To add to Ryan's (extremely valid!) point, your test works because the
 coordinator is always a replica.  Try again using 20 (or 50) nodes.
 Batching works great at RF=N=3 because it always gets to write to local 
 and
 talk to exactly 2 other servers on every request.  Consider what happens
 when the coordinator needs to talk to 100 servers.  It's unnecessary
 overhead on the server side.

 To save network overhead, Cassandra 2.1 added support for response
 grouping (see
 http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
 which massively helps performance.  It provides the benefit of batches but
 without the coordinator overhead.

 Can you post your benchmark code?

 On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com
 wrote:

 There are cases where it can.  For instance, if you batch multiple
 mutations to the same partition (and talk to a replica for that 
 partition)
 they can reduce network overhead because they're effectively a single
 mutation in the eye of the cluster.  However, if you're not doing that 
 (and
 most people aren't!) you end up putting additional pressure on the
 coordinator because now it has to talk to several other servers.  If you
 have 100 servers, and perform a mutation on 100 partitions, you could 
 have
 a coordinator that's

 1) talking to every machine in the cluster and
 b) waiting on a response from a significant portion of them

 before it can respond success or fail.  Any delay, from GC to a bad
 disk, can affect the performance of the entire batch.


 On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky 
 j...@basetechnology.com wrote:

   Jonathan and Ryan,

 Jonathan says “It is absolutely not going to help you if you're
 trying to lump queries together to reduce network  server overhead - in
 fact it'll do the opposite”, but I would note that the CQL3 spec says “
 The BATCH statement ... serves several purposes: 1. It saves
 network round-trips between the client and the server (and sometimes
 between the server coordinator and the replicas) when batching multiple
 updates.” Is the spec inaccurate? I mean, it seems in conflict with your
 statement.

 See:
 https://cassandra.apache.org/doc/cql3/CQL.html

 I see the spec as gospel – if it’s not accurate, let’s propose a
 change to make it accurate.

 The DataStax CQL doc is more nuanced: “Batching multiple statements
 can save network exchanges between the client/server and server
 coordinator/replicas. However, because of the distributed nature of
 Cassandra, spread requests across nearby nodes as much as possible to
 optimize performance. Using batches to optimize performance is usually 
 not
 successful, as described in Using and misusing batches section. For
 information about the fastest way to load data, see Cassandra: Batch
 loading without the Batch keyword.”

 Maybe what we really need is a “client/driver-side batch”, which is
 simply a way to collect “batches” of operations in the client/driver and
 then let the driver determine what degree of batching and asynchronous
 operation is appropriate.

 It might also be nice to have an inquiry for the cluster as to what
 batch size is most optimal for the cluster, like number of mutations in 
 a
 batch and number of simultaneous connections, and to have that be 
 dynamic
 based on overall cluster load.

 I would also note that the example in the spec has multiple inserts
 with different partition key values, which flies in the face of the
 admonition to to refrain from using server-side distribution of 
 requests.

 At a minimum the CQL spec should make a more clear statement of
 intent and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jonathan Haddad
: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know 
 that
 several tables all get a mutation (via logged batches).  The use case 
 for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to 
 lump
 queries together to reduce network  server overhead - in fact it'll do 
 the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many
 async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list.
 However, I didn’t see any information about why 5kb+ data will cause
 instability. 5kb or even 50kb seems too small. For example, if each
 mutation is 1000+ bytes, then with just 5 mutations, you will hit that
 threshold.



 In addition, Patrick is saying that he does not recommend more than
 100 mutations per batch. So why not warn users just on the # of 
 mutations
 in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find
 the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't
 recommend more than ~100 mutations per batch. Doing some quick math I 
 came
 up with 5k as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part
 because so many people confuse the BATCH keyword as a performance
 optimization, this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value in
 kilobytes. It says caution should be taken on increasing the size of 
 this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and 
 eBay.





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.






Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Jonathan Haddad
 of
 intent and non-intent for BATCH.

 -- Jack Krupansky

  *From:* Jonathan Haddad j...@jonhaddad.com
 *Sent:* Friday, December 12, 2014 12:58 PM
 *To:* user@cassandra.apache.org ; Ryan Svihla
 rsvi...@datastax.com
 *Subject:* Re: batch_size_warn_threshold_in_kb

 The really important thing to really take away from Ryan's original
 post is that batches are not there for performance.  The only case I
 consider batches to be useful for is when you absolutely need to know 
 that
 several tables all get a mutation (via logged batches).  The use case 
 for
 this is when you've got multiple tables that are serving as different 
 views
 for data.  It is absolutely not going to help you if you're trying to 
 lump
 queries together to reduce network  server overhead - in fact it'll 
 do the
 opposite.  If you're trying to do that, instead perform many async
 queries.  The overhead of batches in cassandra is significant and 
 you're
 going to hit a lot of problems if you use them excessively (timeouts /
 failures).

 tl;dr: you probably don't want batch, you most likely want many
 async calls

 On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller 
 moham...@glassbeam.com wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list.
 However, I didn’t see any information about why 5kb+ data will cause
 instability. 5kb or even 50kb seems too small. For example, if each
 mutation is 1000+ bytes, then with just 5 mutations, you will hit that
 threshold.



 In addition, Patrick is saying that he does not recommend more
 than 100 mutations per batch. So why not warn users just on the # of
 mutations in a batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find
 the story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't
 recommend more than ~100 mutations per batch. Doing some quick math I 
 came
 up with 5k as 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part
 because so many people confuse the BATCH keyword as a performance
 optimization, this helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
 moham...@glassbeam.com wrote:

 Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml
 file, it is used to log WARN on any batch size exceeding this value in
 kilobytes. It says caution should be taken on increasing the size of 
 this
 threshold as it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why
 would a higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image:
 linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for 
 the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and 
 eBay.





 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image:
 linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database
 technology, delivering Apache Cassandra to the world’s most innovative
 enterprises. Datastax is built to be agile, always-on, and predictably
 scalable to any size. With more than 500 customers in 45 countries, 
 DataStax
 is the database technology and transactional backbone of choice for the
 worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.






Re: batch_size_warn_threshold_in_kb

2014-12-12 Thread Ryan Svihla
It's a rough observation and estimate, nothing more. In other words, some
clusters can handle more, some can't, it depends on how many writes per
second you're doing, cluster sizing, how far over that 5kb limit you are,
heap size, disk IO, cpu speed, and many more factors. This is why it's just
a warning and not an error, and it's something that's changeable.

There is no one perfect answer here, but I can safely say in practice with
today's hardware, I've not seen many clusters work well with more than 5kb
writes.


On Fri, Dec 12, 2014 at 1:12 AM, Mohammed Guller moham...@glassbeam.com
wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100 mutations per batch. Doing some quick math I came up with 5k as
 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –

 The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file, it
 is used to log WARN on any batch size exceeding this value in kilobytes. It
 says caution should be taken on increasing the size of this threshold as it
 can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.





-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: batch_size_warn_threshold_in_kb

2014-12-12 Thread Ryan Svihla
Any insert, update, or delete

On Fri, Dec 12, 2014 at 1:31 AM, Jens Rantil jens.ran...@tink.se wrote:

 Maybe slightly off-topic, but what is a mutation? Is it equivalent to a
 CQL row? Or maybe a column in a row? Does include tombstones within the
 selected range?

 Thanks,
 Jens



 On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla rsvi...@datastax.com wrote:

 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here

 https://issues.apache.org/jira/browse/CASSANDRA-6487

 Key reasoning for the desire comes from Patrick McFadden:

 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100 mutations per batch. Doing some quick math I came up with 5k as
 100 x 50 byte mutations.

 Totally up for debate.

 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.

 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

   Hi –

 The cassandra.yaml file has property called 
 *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file,
 it is used to log WARN on any batch size exceeding this value in kilobytes.
 It says caution should be taken on increasing the size of this threshold as
 it can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: batch_size_warn_threshold_in_kb

2014-12-12 Thread Jonathan Haddad
The really important thing to really take away from Ryan's original post is
that batches are not there for performance.  The only case I consider
batches to be useful for is when you absolutely need to know that several
tables all get a mutation (via logged batches).  The use case for this is
when you've got multiple tables that are serving as different views for
data.  It is absolutely not going to help you if you're trying to lump
queries together to reduce network  server overhead - in fact it'll do the
opposite.  If you're trying to do that, instead perform many async
queries.  The overhead of batches in cassandra is significant and you're
going to hit a lot of problems if you use them excessively (timeouts /
failures).

tl;dr: you probably don't want batch, you most likely want many async calls

On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
wrote:

  Ryan,

 Thanks for the quick response.



 I did see that jira before posting my question on this list. However, I
 didn’t see any information about why 5kb+ data will cause instability. 5kb
 or even 50kb seems too small. For example, if each mutation is 1000+ bytes,
 then with just 5 mutations, you will hit that threshold.



 In addition, Patrick is saying that he does not recommend more than 100
 mutations per batch. So why not warn users just on the # of mutations in a
 batch?



 Mohammed



 *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
 *Sent:* Thursday, December 11, 2014 12:56 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: batch_size_warn_threshold_in_kb



 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here



 https://issues.apache.org/jira/browse/CASSANDRA-6487



 Key reasoning for the desire comes from Patrick McFadden:


 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100 mutations per batch. Doing some quick math I came up with 5k as
 100 x 50 byte mutations.

 Totally up for debate.



 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.



 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

 Hi –

 The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file, it
 is used to log WARN on any batch size exceeding this value in kilobytes. It
 says caution should be taken on increasing the size of this threshold as it
 can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed




 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect


 [image: twitter.png] https://twitter.com/foundev[image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/



 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.





Re: batch_size_warn_threshold_in_kb

2014-12-11 Thread Ryan Svihla
Nothing magic, just put in there based on experience. You can find the
story behind the original recommendation here

https://issues.apache.org/jira/browse/CASSANDRA-6487

Key reasoning for the desire comes from Patrick McFadden:

Yes that was in bytes. Just in my own experience, I don't recommend more
than ~100 mutations per batch. Doing some quick math I came up with 5k as
100 x 50 byte mutations.

Totally up for debate.

It's totally changeable, however, it's there in no small part because so
many people confuse the BATCH keyword as a performance optimization, this
helps flag those cases of misuse.

On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
wrote:

   Hi –

 The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file, it
 is used to log WARN on any batch size exceeding this value in kilobytes. It
 says caution should be taken on increasing the size of this threshold as it
 can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


RE: batch_size_warn_threshold_in_kb

2014-12-11 Thread Mohammed Guller
Ryan,
Thanks for the quick response.

I did see that jira before posting my question on this list. However, I didn’t 
see any information about why 5kb+ data will cause instability. 5kb or even 
50kb seems too small. For example, if each mutation is 1000+ bytes, then with 
just 5 mutations, you will hit that threshold.

In addition, Patrick is saying that he does not recommend more than 100 
mutations per batch. So why not warn users just on the # of mutations in a 
batch?

Mohammed

From: Ryan Svihla [mailto:rsvi...@datastax.com]
Sent: Thursday, December 11, 2014 12:56 PM
To: user@cassandra.apache.org
Subject: Re: batch_size_warn_threshold_in_kb

Nothing magic, just put in there based on experience. You can find the story 
behind the original recommendation here

https://issues.apache.org/jira/browse/CASSANDRA-6487

Key reasoning for the desire comes from Patrick McFadden:

Yes that was in bytes. Just in my own experience, I don't recommend more than 
~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 
byte mutations.

Totally up for debate.

It's totally changeable, however, it's there in no small part because so many 
people confuse the BATCH keyword as a performance optimization, this helps flag 
those cases of misuse.

On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Hi –
The cassandra.yaml file has property called batch_size_warn_threshold_in_kb.
The default size is 5kb and according to the comments in the yaml file, it is 
used to log WARN on any batch size exceeding this value in kilobytes. It says 
caution should be taken on increasing the size of this threshold as it can lead 
to node instability.

Does anybody know the significance of this magic number 5kb? Why would a higher 
number (say 10kb) lead to node instability?

Mohammed


--

[datastax_logo.png]http://www.datastax.com/

Ryan Svihla

Solution Architect

[twitter.png]https://twitter.com/foundev[linkedin.png]http://www.linkedin.com/pub/ryan-svihla/12/621/727/


DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay.



Re: batch_size_warn_threshold_in_kb

2014-12-11 Thread Jens Rantil
Maybe slightly off-topic, but what is a mutation? Is it equivalent to a CQL 
row? Or maybe a column in a row? Does include tombstones within the selected 
range?

Thanks,
Jens

On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla rsvi...@datastax.com wrote:

 Nothing magic, just put in there based on experience. You can find the
 story behind the original recommendation here
 https://issues.apache.org/jira/browse/CASSANDRA-6487
 Key reasoning for the desire comes from Patrick McFadden:
 Yes that was in bytes. Just in my own experience, I don't recommend more
 than ~100 mutations per batch. Doing some quick math I came up with 5k as
 100 x 50 byte mutations.
 Totally up for debate.
 It's totally changeable, however, it's there in no small part because so
 many people confuse the BATCH keyword as a performance optimization, this
 helps flag those cases of misuse.
 On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com
 wrote:

   Hi –

 The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb.
 *

 The default size is 5kb and according to the comments in the yaml file, it
 is used to log WARN on any batch size exceeding this value in kilobytes. It
 says caution should be taken on increasing the size of this threshold as it
 can lead to node instability.



 Does anybody know the significance of this magic number 5kb? Why would a
 higher number (say 10kb) lead to node instability?



 Mohammed

 -- 
 [image: datastax_logo.png] http://www.datastax.com/
 Ryan Svihla
 Solution Architect
 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/
 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.