Re: batch_size_warn_threshold_in_kb
the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative
Re: batch_size_warn_threshold_in_kb
...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always
Re: batch_size_warn_threshold_in_kb
Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky From: Jonathan Haddad Sent: Friday, December 12, 2014 12:58 PM To: user@cassandra.apache.org ; Ryan Svihla Subject: Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed From: Ryan Svihla [mailto:rsvi...@datastax.com] Sent: Thursday, December 11, 2014 12:56 PM To: user@cassandra.apache.org Subject: Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called batch_size_warn_threshold_in_kb. The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed
Re: batch_size_warn_threshold_in_kb
test5 ((aid, bckt, end)) = 380,737,675,612 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches of 100 using strategy parallel Total Run Time test3 ((aid, bckt), end, proto) reverse order= 20,971,045,814 test1 ((aid, bckt), proto, end) reverse order= 21,379,583,690 test4 ((aid, bckt), proto, end) no explicit ordering = 21,505,965,087 test2 ((aid, bckt), end) = 24,433,580,144 test5 ((aid, bckt, end)) = 37,346,062,553 On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com wrote: The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100
Re: batch_size_warn_threshold_in_kb
strategy traverse Total Run Time test3 ((aid, bckt), end, proto) reverse order= 9,633,141,094 test4 ((aid, bckt), proto, end) no explicit ordering = 12,519,381,544 test2 ((aid, bckt), end) = 12,653,843,637 test1 ((aid, bckt), proto, end) reverse order= 17,644,182,274 test5 ((aid, bckt, end)) = 27,902,501,534 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single statements using strategy parallel Total Run Time test1 ((aid, bckt), proto, end) reverse order= 360,523,086,443 test3 ((aid, bckt), end, proto) reverse order= 364,375,212,413 test4 ((aid, bckt), proto, end) no explicit ordering = 370,989,615,452 test2 ((aid, bckt), end) = 378,368,728,469 test5 ((aid, bckt, end)) = 380,737,675,612 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches of 100 using strategy parallel Total Run Time test3 ((aid, bckt), end, proto) reverse order= 20,971,045,814 test1 ((aid, bckt), proto, end) reverse order= 21,379,583,690 test4 ((aid, bckt), proto, end) no explicit ordering = 21,505,965,087 test2 ((aid, bckt), end) = 24,433,580,144 test5 ((aid, bckt, end)) = 37,346,062,553 On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com wrote: The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries
Re: batch_size_warn_threshold_in_kb
) = 52,676,830,110 test4 ((aid, bckt), proto, end) no explicit ordering = 54,096,838,258 test5 ((aid, bckt, end)) = 54,657,464,976 test3 ((aid, bckt), end, proto) reverse order= 55,668,202,827 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches of 100 using strategy traverse Total Run Time test3 ((aid, bckt), end, proto) reverse order= 9,633,141,094 test4 ((aid, bckt), proto, end) no explicit ordering = 12,519,381,544 test2 ((aid, bckt), end) = 12,653,843,637 test1 ((aid, bckt), proto, end) reverse order= 17,644,182,274 test5 ((aid, bckt, end)) = 27,902,501,534 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) as single statements using strategy parallel Total Run Time test1 ((aid, bckt), proto, end) reverse order= 360,523,086,443 test3 ((aid, bckt), end, proto) reverse order= 364,375,212,413 test4 ((aid, bckt), proto, end) no explicit ordering = 370,989,615,452 test2 ((aid, bckt), end) = 378,368,728,469 test5 ((aid, bckt, end)) = 380,737,675,612 Execution Results for 25 runs of 5 records = 25 runs of 50,000 records (3 protos, 5 agents, ~15 per bucket) in batches of 100 using strategy parallel Total Run Time test3 ((aid, bckt), end, proto) reverse order= 20,971,045,814 test1 ((aid, bckt), proto, end) reverse order= 21,379,583,690 test4 ((aid, bckt), proto, end) no explicit ordering = 21,505,965,087 test2 ((aid, bckt), end) = 24,433,580,144 test5 ((aid, bckt, end)) = 37,346,062,553 On Fri Dec 12 2014 at 11:00:12 AM Jonathan Haddad j...@jonhaddad.com wrote: The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http
Re: batch_size_warn_threshold_in_kb
To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each
Re: batch_size_warn_threshold_in_kb
Also..what happens when you turn on shuffle with token aware? http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com wrote: To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com
Re: batch_size_warn_threshold_in_kb
:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance
Re: batch_size_warn_threshold_in_kb
-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find
Re: batch_size_warn_threshold_in_kb
: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any
Re: batch_size_warn_threshold_in_kb
to resolve our internal dependencies. This may not be today though. Also, @Ryan, I don't think that shuffling would make a difference for my above tests since as Jon observed, all my nodes were already replicas there. On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla rsvi...@datastax.com wrote: Also..what happens when you turn on shuffle with token aware? http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com wrote: To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “ The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async
Re: batch_size_warn_threshold_in_kb
).iterator().next } } val result = Future.traverse(groupByFirstReplica(statements).values).map(st = newBatch(st).executeAsync()) Let me get together my test code, it depends on some existing utilities we use elsewhere, such as implicit conversions between Google and Scala native futures. I'll try to put this together in a format that's runnable for you in a Scala REPL console without having to resolve our internal dependencies. This may not be today though. Also, @Ryan, I don't think that shuffling would make a difference for my above tests since as Jon observed, all my nodes were already replicas there. On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla rsvi...@datastax.com wrote: Also..what happens when you turn on shuffle with token aware? http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com wrote: To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “ The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful
Re: batch_size_warn_threshold_in_kb
: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
It's a rough observation and estimate, nothing more. In other words, some clusters can handle more, some can't, it depends on how many writes per second you're doing, cluster sizing, how far over that 5kb limit you are, heap size, disk IO, cpu speed, and many more factors. This is why it's just a warning and not an error, and it's something that's changeable. There is no one perfect answer here, but I can safely say in practice with today's hardware, I've not seen many clusters work well with more than 5kb writes. On Fri, Dec 12, 2014 at 1:12 AM, Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
Any insert, update, or delete On Fri, Dec 12, 2014 at 1:31 AM, Jens Rantil jens.ran...@tink.se wrote: Maybe slightly off-topic, but what is a mutation? Is it equivalent to a CQL row? Or maybe a column in a row? Does include tombstones within the selected range? Thanks, Jens On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla rsvi...@datastax.com wrote: Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
RE: batch_size_warn_threshold_in_kb
Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed From: Ryan Svihla [mailto:rsvi...@datastax.com] Sent: Thursday, December 11, 2014 12:56 PM To: user@cassandra.apache.org Subject: Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called batch_size_warn_threshold_in_kb. The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [datastax_logo.png]http://www.datastax.com/ Ryan Svihla Solution Architect [twitter.png]https://twitter.com/foundev[linkedin.png]http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: batch_size_warn_threshold_in_kb
Maybe slightly off-topic, but what is a mutation? Is it equivalent to a CQL row? Or maybe a column in a row? Does include tombstones within the selected range? Thanks, Jens On Thu, Dec 11, 2014 at 9:56 PM, Ryan Svihla rsvi...@datastax.com wrote: Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.