[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round

2022-07-29 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573088#comment-17573088
 ] 

Daniel Cranford commented on CASSANDRA-13851:
-

[~brandon.williams] ticket added. cf CASSANDRA-17786

> Allow existing nodes to use all peers in shadow round
> -
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Kurt Greaves
>Assignee: Kurt Greaves
>Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-17786) Seed docs are out of date

2022-07-29 Thread Daniel Cranford (Jira)
Daniel Cranford created CASSANDRA-17786:
---

 Summary: Seed docs are out of date
 Key: CASSANDRA-17786
 URL: https://issues.apache.org/jira/browse/CASSANDRA-17786
 Project: Cassandra
  Issue Type: Bug
Reporter: Daniel Cranford


The 
[FAQ|https://cassandra.apache.org/doc/latest/cassandra/faq/index.html#are-seeds-SPOF]
 states
{quote}
The ring can operate or boot without a seed
{quote}

This has not been true since Cassandra 3.6 when CASSANDRA-10134 required nodes 
to complete a "shadow" gossip round or specify the undocumented 
```cassandra.allow_unsafe_join``` property. AFAICT this "shadow" round is not 
documented anywhere outside the code implementing it.

CASSANDRA-13851 improved things by allowing other nodes that are not themselves 
booting to release a node from the shadow round and successfully boot. However, 
this still means a node that is booting must contact a seed or a peer that is 
not itself booting in order to start, making seed more crucial to booting than 
the docs imply. 

In particular, a full cluster bounce is not supported when there are no 
reachable seeds since the non-seed peers required to release a node from the 
shadow round will themselves be trapped in the shadow round.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round

2022-07-22 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570152#comment-17570152
 ] 

Daniel Cranford commented on CASSANDRA-13851:
-

I hope it is clear that I/we don't care that the behavior of seeds has changed. 
But rather that this behavior change was not made public (it is "hidden" in 
source code and bug trackers) and took us a significant amount of skilled 
man-hours to track down what had changed and why and what our potential 
workarounds were. Just a simple update to the seed docs would have helped us 
immensely.

> Allow existing nodes to use all peers in shadow round
> -
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Kurt Greaves
>Assignee: Kurt Greaves
>Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round

2022-07-22 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570118#comment-17570118
 ] 

Daniel Cranford commented on CASSANDRA-13851:
-

Respectfully, the shadow round isn't documented at all (outside the source 
code) and gossip is barely documented. My ops guys are going to see `unable to 
gossip with peers` and assume there's a network issue preventing a node from 
talking to any of their peers and not "all my peers are also stuck in this 
undocumented thing called the shadow round"

> Allow existing nodes to use all peers in shadow round
> -
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Kurt Greaves
>Assignee: Kurt Greaves
>Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round

2022-07-22 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570067#comment-17570067
 ] 

Daniel Cranford commented on CASSANDRA-13851:
-

[~samt], sorry, I miss-spoke. I appreciate the material improvement in behavior 
this ticket has provided. What I intended to say was
{quote}A node will note start unless it can contact a seed node *or* another 
node not also performing the shadow round{quote}

Background: my operations guys routinely perform a full cluster bounce to 
ensure everything is starting from a clean state. Up until Cassandra 3.6 this 
worked fine. Unfortunately, due to the details of our hardware, sometimes nodes 
take longer to come up than usual (eg 5 minutes instead of 30 seconds). If the 
slow nodes happen to be the seed node/nodes, it is game over - the cluster will 
not start.

The only way my ops guys were able to figure out how to resolve this was to 
give me the stack trace of the error, which I had to correlate with the source 
code and use `git blame` to find CASSANDRA-10134 and this ticket. I would not 
consider a bug tracker to be appropriate documentation for the semantics of a 
seed node, especially when the public docs state
{quote}The ring can operate or boot without a seed; however, you will not be 
able to add new nodes to the cluster.{quote}

My ops guys have worked around this behavior by begrudgingly setting 
`cassandra.allow_unsafe_joins=true` - an undocumented workaround I found by 
inspecting the source code. After we upgraded from 3.9 to 3.11, I was eager to 
see if this ticket allowed us to remove the workaround. Unfortunately it does 
not, since a full cluster bounce will still fail since only seed nodes and 
nodes not themselves in the shadow round can release a node from the shadow 
round.

If anything, the error message in this version is worse, since it is now 
incorrect. 
{code:java}
if (!isSeed)
throw new RuntimeException("Unable to gossip with any peers");
{code}

actually, the node was unable to gossip with any seeds and any peers not 
themselves in the shadow round. Peers may be alive but themselves trapped in 
the shadow round.

> Allow existing nodes to use all peers in shadow round
> -
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Kurt Greaves
>Assignee: Kurt Greaves
>Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13851) Allow existing nodes to use all peers in shadow round

2022-07-21 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569515#comment-17569515
 ] 

Daniel Cranford commented on CASSANDRA-13851:
-

This is still an *undocumented* regression in the definition of a "seed" node. 
A node *will not start* unless it can contact at least one seed node which is a 
detail that still hasn't made it into the 
[documentation|https://cassandra.apache.org/doc/latest/cassandra/faq/index.html#what-are-seeds]
 

> Allow existing nodes to use all peers in shadow round
> -
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Startup and Shutdown
>Reporter: Kurt Greaves
>Assignee: Kurt Greaves
>Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs

2022-01-05 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469355#comment-17469355
 ] 

Daniel Cranford edited comment on CASSANDRA-17237 at 1/5/22, 2:49 PM:
--

{quote}since mmap was clearly the superior mode, and likely still is with sane 
readahead settings{quote}

Perhaps. In our testing, standard IO has a 10% performance advantage over mmap 
with sane readahead values. Of course, a lot of this is going to boil down to 
hardware specifics, eg what is the IO seek penalty and bandwidth, what is the 
syscall latency vs page fault latency. It certainly doesn't help mmap that half 
the IO bandwidth is wasted compared to standard IO and there is no way to issue 
smaller reads for the index file.

It is worth noting that Linus is on record saying mmap is [not necessarily 
always a win|https://marc.info/?l=linux-kernel=95496636207616=2] The TLB 
miss and page fault mechanism can be more expensive than people realize.

I've looked through the Cassandra git history, and there is the appearance that 
the standard IO path was unfairly penalized by suboptimal behavior which may 
have explained some of the observed benefit to mmap (eg 
https://issues.apache.org/jira/browse/CASSANDRA-8894).

I'm certainly not arguing that standard IO should be the default. But since it 
really is faster in our tests (with sane readahead values), perhaps it should 
still be a documented tunable.


was (Author: daniel.cranford):
{quote}since mmap was clearly the superior mode, and likely still is with sane 
readahead settings{quote}

Perhaps. In our testing, standard IO has a 10% performance advantage over mmap 
with sane readahead values. Of course, a lot of this is going to boil down to 
hardware specifics, eg what is the IO seek penalty and bandwidth, what is the 
syscall latency vs page fault latency. It certainly doesn't help mmap that half 
the IO bandwidth is wasted compared to standard IO and there is no way to issue 
smaller reads for the index file.

It is worth noting that Linus is on record saying mmap is [not necessarily 
always a win|https://marc.info/?l=linux-kernel=95496636207616=2] The TLB 
miss and page fault mechanism can be more expensive than people realize.

I've looked through the Cassandra git history, and there is the appearance that 
the standard IO path was unfairly penalized by suboptimal behavior which may 
have explained some of the observed benefit to mmap.

I'm certainly not arguing that standard IO should be the default. But since it 
really is faster in our tests (with sane readahead values), perhaps it should 
still be a documented tunable.

> Pathalogical interaction between Cassandra and readahead, particularly on 
> Centos 7 VMs
> --
>
> Key: CASSANDRA-17237
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17237
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Daniel Cranford
>Priority: Normal
> Fix For: 4.x
>
>
> Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
> value `disk_access_mode` that controls this isn't even included in or 
> documented in cassandra.yml.
> While this may be a reasonable default config for Cassandra, we've noticed a 
> pathalogical interplay between the way Linux implements readahead for mmap, 
> and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.
> A read that misses all levels of cache in Cassandra is (typically) going to 
> involve 2 IOs: 1 into the index file and one into the data file. These IOs 
> will both be effectively random given the nature the mummer3 hash partitioner.
> The amount of data read from the index file IO will be relatively small, 
> perhaps 4-8kb, compared to the data file IO which (assuming the entire 
> partition fits in a single compressed chunk and a compression ratio of 1/2) 
> will require 32kb.
> However, applications using `mmap()` have no way to tell the OS the desired 
> IO size - they can only tell the OS the desired IO location - by reading from 
> the mapped address and triggering a page fault. This is unlike `read()` where 
> the application provides both the size and location to the OS. So for 
> `mmap()` the OS has to guess how large the IO submitted to the backing device 
> should be and whether the application is performing sequential or random IO 
> unless the application provides hints (eg `fadvise()`, `madvise()`, 
> `readahead()`).
> This is how Linux determines the size of IO for mmap during a page fault:
>  * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead 
> value with the faulting address in the middle of the IO, eg IO requested for 
> range [fault_addr - max_readahead / 2, fault_addr + 

[jira] [Commented] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs

2022-01-05 Thread Daniel Cranford (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469355#comment-17469355
 ] 

Daniel Cranford commented on CASSANDRA-17237:
-

{quote}since mmap was clearly the superior mode, and likely still is with sane 
readahead settings{quote}

Perhaps. In our testing, standard IO has a 10% performance advantage over mmap 
with sane readahead values. Of course, a lot of this is going to boil down to 
hardware specifics, eg what is the IO seek penalty and bandwidth, what is the 
syscall latency vs page fault latency. It certainly doesn't help mmap that half 
the IO bandwidth is wasted compared to standard IO and there is no way to issue 
smaller reads for the index file.

It is worth noting that Linus is on record saying mmap is [not necessarily 
always a win|https://marc.info/?l=linux-kernel=95496636207616=2] The TLB 
miss and page fault mechanism can be more expensive than people realize.

I've looked through the Cassandra git history, and there is the appearance that 
the standard IO path was unfairly penalized by suboptimal behavior which may 
have explained some of the observed benefit to mmap.

I'm certainly not arguing that standard IO should be the default. But since it 
really is faster in our tests (with sane readahead values), perhaps it should 
still be a documented tunable.

> Pathalogical interaction between Cassandra and readahead, particularly on 
> Centos 7 VMs
> --
>
> Key: CASSANDRA-17237
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17237
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Daniel Cranford
>Priority: Normal
> Fix For: 4.x
>
>
> Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
> value `disk_access_mode` that controls this isn't even included in or 
> documented in cassandra.yml.
> While this may be a reasonable default config for Cassandra, we've noticed a 
> pathalogical interplay between the way Linux implements readahead for mmap, 
> and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.
> A read that misses all levels of cache in Cassandra is (typically) going to 
> involve 2 IOs: 1 into the index file and one into the data file. These IOs 
> will both be effectively random given the nature the mummer3 hash partitioner.
> The amount of data read from the index file IO will be relatively small, 
> perhaps 4-8kb, compared to the data file IO which (assuming the entire 
> partition fits in a single compressed chunk and a compression ratio of 1/2) 
> will require 32kb.
> However, applications using `mmap()` have no way to tell the OS the desired 
> IO size - they can only tell the OS the desired IO location - by reading from 
> the mapped address and triggering a page fault. This is unlike `read()` where 
> the application provides both the size and location to the OS. So for 
> `mmap()` the OS has to guess how large the IO submitted to the backing device 
> should be and whether the application is performing sequential or random IO 
> unless the application provides hints (eg `fadvise()`, `madvise()`, 
> `readahead()`).
> This is how Linux determines the size of IO for mmap during a page fault:
>  * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead 
> value with the faulting address in the middle of the IO, eg IO requested for 
> range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This 
> is sometimes referred to as "read around" (ie read around the faulting 
> address). See 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989]
>   * The kernel maintains a cache miss counter for the file. Every time the 
> kernel submits an IO for a page fault, this counts as a miss. Every time the 
> application faults in a page that is already in the pages cache (presumably 
> from a previous page fault's IO) is a cache hit and decrements the counter. 
> If the miss counter exceeds a threshold, the kernel stops inflating the IOs 
> to the max readahead and falls back to reading a *single* 4k page for each 
> page fault. See summary 
> [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1]
>  and implementation 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955]
>  and 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005]
>   * This means an application that, on average, references more than one 4k 
> page around the initial page fault will consistently have page fault IOs 
> inflated to the maximum readahead value. Note, there 

[jira] [Updated] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs

2022-01-04 Thread Daniel Cranford (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-17237:

Description: 
Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
value `disk_access_mode` that controls this isn't even included in or 
documented in cassandra.yml.

While this may be a reasonable default config for Cassandra, we've noticed a 
pathalogical interplay between the way Linux implements readahead for mmap, and 
Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.

A read that misses all levels of cache in Cassandra is (typically) going to 
involve 2 IOs: 1 into the index file and one into the data file. These IOs will 
both be effectively random given the nature the mummer3 hash partitioner.

The amount of data read from the index file IO will be relatively small, 
perhaps 4-8kb, compared to the data file IO which (assuming the entire 
partition fits in a single compressed chunk and a compression ratio of 1/2) 
will require 32kb.

However, applications using `mmap()` have no way to tell the OS the desired IO 
size - they can only tell the OS the desired IO location - by reading from the 
mapped address and triggering a page fault. This is unlike `read()` where the 
application provides both the size and location to the OS. So for `mmap()` the 
OS has to guess how large the IO submitted to the backing device should be and 
whether the application is performing sequential or random IO unless the 
application provides hints (eg `fadvise()`, `madvise()`, `readahead()`).

This is how Linux determines the size of IO for mmap during a page fault:
 * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value 
with the faulting address in the middle of the IO, eg IO requested for range 
[fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is 
sometimes referred to as "read around" (ie read around the faulting address). 
See 
[here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989]
  * The kernel maintains a cache miss counter for the file. Every time the 
kernel submits an IO for a page fault, this counts as a miss. Every time the 
application faults in a page that is already in the pages cache (presumably 
from a previous page fault's IO) is a cache hit and decrements the counter. If 
the miss counter exceeds a threshold, the kernel stops inflating the IOs to the 
max readahead and falls back to reading a *single* 4k page for each page fault. 
See summary 
[here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1]
 and implementation 
[here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955]
 and 
[here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005]
  * This means an application that, on average, references more than one 4k 
page around the initial page fault will consistently have page fault IOs 
inflated to the maximum readahead value. Note, there is no ramping up a window 
the way there is with standard IO. The kernel only submits IOs of 1 page and 
max_readahead as far as I can tell.

Observations:
* mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big 
deal depending on your setup.
* Cassandra will always have IOs inflated to the maximum readahead because more 
than 1 page is references for the data file and (depending on the size and 
cardinality of your keys) more than one page is referenced from the index file
* The device's readahead is a crude system wide knob for controlling IO size. 
Cassandra cannot perform smaller IOs for the index file (unless your keyset is 
such that only 1 page from the index file needs to be referenced).

Centos 7 VMs:
* The default readahead for Centos 7 VMs is 4MB (as opposed to the default 
readahead for non-VM Centos 7 which is 128kb).
* Even though this is reduced by the kernel (cf `max_sane_readahead()`) to 
something around 450k, it is still far too large for an average Cassandra read.
* Even once this readahead is reduced to the recommended 64kb, standard IO 
still has a 10% performance advantage in our tests, likely because the 
readahead algorithm for standard IO is more flexible and converges on smaller 
reads from the index file and larger reads from the data file.

  was:
Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
value `disk_access_mode` that controls this isn't even included in or 
documented in cassandra.yml.

While this may be a reasonable default config for Cassandra, we've noticed a 
pathalogical interplay between the way Linux implements readahead for mmap, and 
Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.

A read that misses all levels of cache in Cassandra is 

[jira] [Updated] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs

2022-01-04 Thread Daniel Cranford (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-17237:

Description: 
Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
value `disk_access_mode` that controls this isn't even included in or 
documented in cassandra.yml.

While this may be a reasonable default config for Cassandra, we've noticed a 
pathalogical interplay between the way Linux implements readahead for mmap, and 
Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.

A read that misses all levels of cache in Cassandra is (typically) going to 
involve 2 IOs: 1 into the index file and one into the data file. These IOs will 
both be effectively random given the nature the mummer3 hash partitioner.

The amount of data read from the index file IO will be relatively small, 
perhaps 4-8kb, compared to the data file IO which (assuming the entire 
partition fits in a single compressed chunk and a compression ratio of 1/2) 
will require 32kb.

However, applications using `mmap()` have no way to tell the OS the desired IO 
size - they can only tell the OS the desired IO location - by reading from the 
mapped address and triggering a page fault. This is unlike `read()` where the 
application provides both the size and location to the OS. So for `mmap()` the 
OS has to guess how large the IO submitted to the backing device should be and 
whether the application is performing sequential or random IO unless the 
application provides hints (eg `fadvise()`, `madvise()`, `readahead()`).

This is how Linux determines the size of IO for mmap during a page fault:
 * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value 
with the faulting address in the middle of the IO, eg IO requested for range 
[fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is 
sometimes referred to as "read around" (ie read around the faulting address). 
See 
[here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989)
  * The kernel maintains a cache miss counter for the file. Every time the 
kernel submits an IO for a page fault, this counts as a miss. Every time the 
application faults in a page that is already in the pages cache (presumably 
from a previous page fault's IO) is a cache hit and decrements the counter. If 
the miss counter exceeds a threshold, the kernel stops inflating the IOs to the 
max readahead and falls back to reading a *single* 4k page for each page fault. 
See summary 
[here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1]
 and implementation 
[here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955]
 and 
[here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005]
  * This means an application that, on average, references more than one 4k 
page around the initial page fault will consistently have page fault IOs 
inflated to the maximum readahead value. Note, there is no ramping up a window 
the way there is with standard IO. The kernel only submits IOs of 1 page and 
max_readahead as far as I can tell.

Observations:
* mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big 
deal depending on your setup.
* Cassandra will always have IOs inflated to the maximum readahead because more 
than 1 page is references for the data file and (depending on the size and 
cardinality of your keys) more than one page is referenced from the index file
* The device's readahead is a crude system wide knob for controlling IO size. 
Cassandra cannot perform smaller IOs for the index file (unless your keyset is 
such that only 1 page from the index file needs to be referenced).

Centos 7 VMs:
* The default readahead for Centos 7 VMs is 4MB (as opposed to the default 
readahead for non-VM Centos 7 which is 128kb).
* Even though this is reduced by the kernel (cf `max_sane_readahead()`) to 
something around 450k, it is still far too large for an average Cassandra read.
* Even once this readahead is reduced to the recommended 64kb, standard IO 
still has a 10% performance advantage in our tests, likely because the 
readahead algorithm for standard IO is more flexible and converges on smaller 
reads from the index file and larger reads from the data file.

  was:
Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
value `disk_access_mode` that controls this isn't even included in or 
documented in cassandra.yml.

While this may be a reasonable default config for Cassandra, we've noticed a 
pathalogical interplay between the way Linux implements readahead for mmap, and 
Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.

A read that misses all levels of cache in Cassandra is 

[jira] [Created] (CASSANDRA-17237) Pathalogical interaction between Cassandra and readahead, particularly on Centos 7 VMs

2022-01-04 Thread Daniel Cranford (Jira)
Daniel Cranford created CASSANDRA-17237:
---

 Summary: Pathalogical interaction between Cassandra and readahead, 
particularly on Centos 7 VMs
 Key: CASSANDRA-17237
 URL: https://issues.apache.org/jira/browse/CASSANDRA-17237
 Project: Cassandra
  Issue Type: Bug
Reporter: Daniel Cranford


Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
value `disk_access_mode` that controls this isn't even included in or 
documented in cassandra.yml.

While this may be a reasonable default config for Cassandra, we've noticed a 
pathalogical interplay between the way Linux implements readahead for mmap, and 
Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.

A read that misses all levels of cache in Cassandra is (typically) going to 
involve 2 IOs: 1 into the index file and one into the data file. These IOs will 
both be effectively random given the nature the mummer3 hash partitioner.

The amount of data read from the index file IO will be relatively small, 
perhaps 4-8kb, compared to the data file IO which (assuming the entire 
partition fits in a single compressed chunk and a compression ratio of 1/2) 
will require 32kb.

However, applications using `mmap()` have no way to tell the OS the desired IO 
size - they can only tell the OS the desired IO location - by reading from the 
mapped address and triggering a page fault. This is unlike `read()` where the 
application provides both the size and location to the OS. So for `mmap()` the 
OS has to guess how large the IO submitted to the backing device should be and 
whether the application is performing sequential or random IO unless the 
application provides hints (eg `fadvise()`, `madvise()`, `readahead()`).

This is how Linux determines the size of IO for mmap during a page fault:
 * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead value 
with the faulting address in the middle of the IO, eg IO requested for range 
[fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This is 
sometimes referred to as "read around" (ie read around the faulting address). 
See 
[here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989)
  * The kernel maintains a cache miss counter for the file. Every time the 
kernel submits an IO for a page fault, this counts as a miss. Every time the 
application faults in a page that is already in the pages cache (presumably 
from a previous page fault's IO) is a cache hit and decrements the counter. If 
the miss counter exceeds a threshold, the kernel stops inflating the IOs to the 
max readahead and falls back to reading a *single* 4k page for each page fault. 
See summary 
[here](https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1)
 and implementation 
[here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955)
 and 
[here](https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005)
  * This means an application that, on average, references more than one 4k 
page around the initial page fault will consistently have page fault IOs 
inflated to the maximum readahead value. Note, there is no ramping up a window 
the way there is with standard IO. The kernel only submits IOs of 1 page and 
max_readahead as far as I can tell.

Observations:
* mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a big 
deal depending on your setup.
* Cassandra will always have IOs inflated to the maximum readahead because more 
than 1 page is references for the data file and (depending on the size and 
cardinality of your keys) more than one page is referenced from the index file
* The device's readahead is a crude system wide knob for controlling IO size. 
Cassandra cannot perform smaller IOs for the index file (unless your keyset is 
such that only 1 page from the index file needs to be referenced).

Centos 7 VMs:
* The default readahead for Centos 7 VMs is 4MB (as opposed to the default 
readahead for non-VM Centos 7 which is 128kb).
* Even though this is reduced by the kernel (cf `max_sane_readahead()`) to 
something around 450k, it is still far too large for an average Cassandra read.
* Even once this readahead is reduced to the recommended 64kb, standard IO 
still has a 10% performance advantage in our tests, likely because the 
readahead algorithm for standard IO is more flexible and converges on smaller 
reads from the index file and larger reads from the data file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10134) Always require replace_address to replace existing address

2017-10-19 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211198#comment-16211198
 ] 

Daniel Cranford commented on CASSANDRA-10134:
-

This fix changes the definition of a "seed" node and invalidates the 
description provided in the [FAQ 
page|http://cassandra.apache.org/doc/latest/faq/index.html#what-are-seeds]. 
Because the shadow round only talks to seeds and the shadow round is now 
performed on every startup (not just bootstrap), a node will not boot unless at 
least one seed is alive.


> Always require replace_address to replace existing address
> --
>
> Key: CASSANDRA-10134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10134
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Distributed Metadata
>Reporter: Tyler Hobbs
>Assignee: Sam Tunnicliffe
>  Labels: docs-impacting
> Fix For: 3.6
>
>
> Normally, when a node is started from a clean state with the same address as 
> an existing down node, it will fail to start with an error like this:
> {noformat}
> ERROR [main] 2015-08-19 15:07:51,577 CassandraDaemon.java:554 - Exception 
> encountered during startup
> java.lang.RuntimeException: A node with address /127.0.0.3 already exists, 
> cancelling join. Use cassandra.replace_address if you want to replace this 
> node.
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543)
>  ~[main/:na]
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:783)
>  ~[main/:na]
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:720)
>  ~[main/:na]
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:611)
>  ~[main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) 
> [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:537)
>  [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:626) 
> [main/:na]
> {noformat}
> However, if {{auto_bootstrap}} is set to false or the node is in its own seed 
> list, it will not throw this error and will start normally.  The new node 
> then takes over the host ID of the old node (even if the tokens are 
> different), and the only message you will see is a warning in the other 
> nodes' logs:
> {noformat}
> logger.warn("Changing {}'s host ID from {} to {}", endpoint, storedId, 
> hostId);
> {noformat}
> This could cause an operator to accidentally wipe out the token information 
> for a down node without replacing it.  To fix this, we should check for an 
> endpoint collision even if {{auto_bootstrap}} is false or the node is a seed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13940) Fix stress seed multiplier

2017-10-05 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192830#comment-16192830
 ] 

Daniel Cranford commented on CASSANDRA-13940:
-

See [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-12744?focusedCommentId=16192820=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16192820]
 for a full explanation of the problem.

> Fix stress seed multiplier
> --
>
> Key: CASSANDRA-13940
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13940
> Project: Cassandra
>  Issue Type: Bug
>  Components: Stress
>Reporter: Daniel Cranford
> Attachments: 0001-Fixing-seed-multiplier.patch
>
>
> CASSANDRA-12744 attempted to fix a problem with partition key generation, but 
> is generally broken. E.G.
> {noformat}
> cassandra-stress -insert visits=fixed\(100\) revisit=uniform\(1..100\) ...
> {noformat}
> sends cassandra-stress into an infinite loop. Here's a better fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

2017-10-05 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192827#comment-16192827
 ] 

Daniel Cranford commented on CASSANDRA-12744:
-

Created CASSANDRA-13940 to fix this.

> Randomness of stress distributions is not good
> --
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: T Jake Luciani
>Assignee: Ben Slater
>Priority: Minor
>  Labels: stress
> Fix For: 4.0
>
> Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13940) Fix stress seed multiplier

2017-10-05 Thread Daniel Cranford (JIRA)
Daniel Cranford created CASSANDRA-13940:
---

 Summary: Fix stress seed multiplier
 Key: CASSANDRA-13940
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13940
 Project: Cassandra
  Issue Type: Bug
  Components: Stress
Reporter: Daniel Cranford
 Attachments: 0001-Fixing-seed-multiplier.patch

CASSANDRA-12744 attempted to fix a problem with partition key generation, but 
is generally broken. E.G.

{noformat}
cassandra-stress -insert visits=fixed\(100\) revisit=uniform\(1..100\) ...
{noformat}

sends cassandra-stress into an infinite loop. Here's a better fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

2017-10-05 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192820#comment-16192820
 ] 

Daniel Cranford commented on CASSANDRA-12744:
-

Some more thoughts:

The generation of partition keys has been "broken" since CASSANDRA-7519

The Linear Congruential Generators (LCGs) used in java.util.Random and by 
extension JDKRandomGenerator generate good random number sequences, but similar 
seeds result in similar sequences. Using the lcg update function {{lcg\(x) = 
a*x + c}} like

random ~1~ = lcg(1)
random ~2~ = lcg(2)
random ~3~ = lcg(3)
...
random ~n~ = lcg\(n)

does not generate a good random sequence, this is a misuse of the LCG. LCGs are 
supposed to be used like

random ~1~ = lcg(1)
random ~2~ = lcg(lcg(1))
random ~3~ = lcg(lcg(lcg(1)))
...
random ~n~ = lcg ^n^ (1)

I say "broken" in quotes because the misuse of LCGs ends up not mattering. 
{{new java.util.Random(seed).nextDouble()}} will always differ from {{new 
java.util.Random(seed + 1).nextDouble()}} by more than 1/100,000,000,000 Thus 
with the default partition key population (=UNIFORM(1..100B)), seeds that 
differ by 1 will generate distinct partition keys.

The only thing that matters about partition keys is how many distinct values 
there are (and how large their lexical value is). The number of partition key 
components doesn't matter. The cardinality of each partition key component 
doesn't matter. The distribution of values in the lexical partition key space 
doesn't matter.

At the end of the day, all the partition key components get concatenated and 
the resulting bit vector is hashed resulting in a uniformly distributed 64 bit 
token that determines where the data will be stored.

The easiest "fix" is to not use the partition key population to define the 
number of partition keys. Take advantage of the fact that the only thing that 
matters from a performance standpoint is the number of distinct partitions. 
Leave the partition key distribution at uniform(1..100B), and use the n= 
parameter to define the number of partitions.

An ideal fix would update the way partition keys are generated to use the LCG 
generator properly. However, this seems difficult since LCGs don't support 
random access (i.e., the only way to calculate the nth item in an LCG sequence 
is to first calculate the n-1 preceding items), and all three seed generation 
modes rely on the ability to randomly jump around in the seed sequence. This 
could be worked around by using a PRNG that supports random access to the n'th 
item in the sequence (e.g. something like JDK 1.8's SpittableRandom could be 
easily extended to support this).

A more workable fix is to spread the generated seeds (typically drawn from a 
smallish range of integers) around in the 2 ^64^ values a long can take before 
seeding the LCG. An additional caveat to whatever function is used for 
spreading the seeds needs to be invertable since LookbackableWriteGenerator's 
implementation relies on the properties of the sequence it generates to perform 
internal bookeeping.

Multiplication by an odd integer happens to be an invertable function (although 
integer division is NOT the inverse operation, multiplication by the modular 
inverse is). So the initial implementation (although broken) is not actually 
that bad an idea. I propose fixing things by picking a static integer as the 
multiplier and using multiplication by it's modular inverse to invert it for 
LookbackableWriteGenerator


> Randomness of stress distributions is not good
> --
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: T Jake Luciani
>Assignee: Ben Slater
>Priority: Minor
>  Labels: stress
> Fix For: 4.0
>
> Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13932) Stress write order and seed order should be different

2017-10-03 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-13932:

Summary: Stress write order and seed order should be different  (was: Write 
order and seed order should be different)

> Stress write order and seed order should be different
> -
>
> Key: CASSANDRA-13932
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13932
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Daniel Cranford
>  Labels: stress
> Attachments: 0001-Initial-implementation-cassandra-3.11.patch, 
> vmtouch-after.txt, vmtouch-before.txt
>
>
> Read tests get an unrealistic boost in performance because they read data 
> from a set of partitions that was written sequentially.
> I ran into this while running a timed read test against a large data set (250 
> million partition keys) {noformat}cassandra-stress read 
> duration=30m{noformat} While the test was running, I noticed one node was 
> performing zero IO after an initial period.
> I discovered each node in the cluster only had blocks from a single SSTable 
> loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat}
> For the node that was performing zero IO, the SSTable in question was small 
> enough to fit into the FS cache.
> I realized that when a read test is run for a duration or until rate 
> convergenge, the default population for the seeds is a GAUSSIAN distribution 
> over the first million seeds. Because of the way compaction works, partitions 
> that are written sequentially will (with high probability) always live in the 
> same SSTable. That means that while the first million seeds will generate 
> partition keys that will be randomly distributed in the token space, they 
> will most likely all live in the same SSTable. When this SSTable is small 
> enough to fit into the FS cache, you get unbelievably good results for a read 
> test. Consider that a dataset 4x the size of the FS cache will have almost 
> 1/2 the data in SSTables small enough to fit into the FS cache.
> Adjusting the population of seeds used during the read test to be the entire 
> 250 million seeds used to load the cluster does not fix the 
> problem.{noformat}cassandra-stress read duration=30m -pop 
> dist=gaussian(1..250M){noformat}
> or (same population, larger sample) {noformat}cassandra-stress read 
> n=250M{noformat}
> Any distribution other than the uniform distribution has one or more modes, 
> and the mode(s) of such a distribution will cluster reads around a certain 
> seed range which corresponds to a certain set of sequential writes which 
> corresponds to (with high probability) a single SSTable.
> My patch against cassandra-3.11 fixes this by shuffling the sequence of 
> generated seeds. Each seed value will still be generated once and only once. 
> The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) 
> may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress 
> read duration=30m -pop no-shuffle{noformat}
> Results: In [^vmtouch-before.txt] only pages from a single SSTable are 
> present in the FS cache while in [^vmtouch-after.txt] an equal proportion of 
> all SSTables are present in the FS cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13932) Write order and seed order should be different

2017-10-03 Thread Daniel Cranford (JIRA)
Daniel Cranford created CASSANDRA-13932:
---

 Summary: Write order and seed order should be different
 Key: CASSANDRA-13932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13932
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Reporter: Daniel Cranford
 Attachments: 0001-Initial-implementation-cassandra-3.11.patch, 
vmtouch-after.txt, vmtouch-before.txt

Read tests get an unrealistic boost in performance because they read data from 
a set of partitions that was written sequentially.

I ran into this while running a timed read test against a large data set (250 
million partition keys) {noformat}cassandra-stress read duration=30m{noformat} 
While the test was running, I noticed one node was performing zero IO after an 
initial period.

I discovered each node in the cluster only had blocks from a single SSTable 
loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat}

For the node that was performing zero IO, the SSTable in question was small 
enough to fit into the FS cache.

I realized that when a read test is run for a duration or until rate 
convergenge, the default population for the seeds is a GAUSSIAN distribution 
over the first million seeds. Because of the way compaction works, partitions 
that are written sequentially will (with high probability) always live in the 
same SSTable. That means that while the first million seeds will generate 
partition keys that will be randomly distributed in the token space, they will 
most likely all live in the same SSTable. When this SSTable is small enough to 
fit into the FS cache, you get unbelievably good results for a read test. 
Consider that a dataset 4x the size of the FS cache will have almost 1/2 the 
data in SSTables small enough to fit into the FS cache.

Adjusting the population of seeds used during the read test to be the entire 
250 million seeds used to load the cluster does not fix the 
problem.{noformat}cassandra-stress read duration=30m -pop 
dist=gaussian(1..250M){noformat}
or (same population, larger sample) {noformat}cassandra-stress read 
n=250M{noformat}

Any distribution other than the uniform distribution has one or more modes, and 
the mode(s) of such a distribution will cluster reads around a certain seed 
range which corresponds to a certain set of sequential writes which corresponds 
to (with high probability) a single SSTable.

My patch against cassandra-3.11 fixes this by shuffling the sequence of 
generated seeds. Each seed value will still be generated once and only once. 
The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) 
may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress 
read duration=30m -pop no-shuffle{noformat}

Results: In [^vmtouch-before.txt] only pages from a single SSTable are present 
in the FS cache while in [^vmtouch-after.txt] an equal proportion of all 
SSTables are present in the FS cache.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

2017-10-03 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190085#comment-16190085
 ] 

Daniel Cranford commented on CASSANDRA-12744:
-

As I've thought about how to fix the seed multiplier, I've come to the 
conclusion that it is impossible to use an adaptive multiplier without breaking 
existing functionality or changing the command line interface.

One of the key reasons you can specify how the seeds get generated is so that 
you can partition the seed space and run multiple cassandra-stress processes on 
different machines in parallel so the cassandra-stress client doesn't become 
the bottleneck. E.G. to write 2 million partitions from two client machines, 
you'd run {noformat}cassandra-stress write n=100 -pop 
seq=1..100{noformat} on one client machine and {noformat}cassandra-stress 
write n=100 -pop seq=101..200{noformat} on the other client machine.

An adaptive multiplier that attempts to scale the seed sequence so that it's 
range is 10^22 (or better, Long.MAX_VALUE since seeds are 64 bit longs) would 
generate the same multiplier for both client processes resulting in seed 
sequence overlaps.

To correctly generate an adaptive multiplier, you need global knowledge of the 
entire range of seeds being generated by all cassandra-stress processes. This 
information cannot be supplied via the current command line interface. The 
command line interface would have to be updated in a breaking fashion to 
support an adaptive multiplier.

Using a hardcoded static multiplier is safe, but would reduce the allowable 
range of seed values (and thus reduce the maximum number of distinct partition 
keys). This probably isn't a big deal since nobody wants to write 2^64 
partitions. But it would need to be chosen with care so that the number of 
distinct seeds (and thus the number of distinct partitions) doesn't become too 
small.



> Randomness of stress distributions is not good
> --
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: T Jake Luciani
>Assignee: Ben Slater
>Priority: Minor
>  Labels: stress
> Fix For: 4.0
>
> Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

2017-09-28 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184418#comment-16184418
 ] 

Daniel Cranford commented on CASSANDRA-12744:
-

I think the math on this is broken slightly. The seed multiplier is intended to 
scale all seeds to the 10^22 magnitude. However, seeds (and the multiplier) are 
all stored in 64 bit integers and the math is performed on them is 64 bit math.

10^22 is not representable as a long which has range {noformat}[-(2^63) : 2^63 
- 1] = [-9,223,372,036,854,775,808 : 9,223,372,036,854,775,807]{noformat}

Consider that for sample sizes under 1084, the line that calculates the the 
sample multiplier 
{noformat}this.sampleMultiplier = 1 + Math.round(Math.pow(10D, 22 - 
Math.log10(sampleSize)));{noformat}
will result in a multiplier of Long.MIN_VALUE which when multiplied by any long 
will result in 0 or Long.MIN_VALUE reducing your seeds to two distinct values.

I think using 18 instead of 22 as the target exponent should resolve this issue.

Additionally, I think the seed population size is being incorrectly calculated 
as the range of the revisit distribution (which defaults to uniform(1..1M)). 
However, when running in the default sequential seed mode (without revisits), 
eg {noformat}cassandra-stress write n=100{noformat}, the size of the seed 
population is actually the length of the seed sequence (in this case 100).

And when running with seeds generated from a distribution, eg 
{noformat}cassandra-stress read -pop dist=gaussian(1..250M){noformat} the size 
of the seed population is actually the range of the seed distribution (in this 
case 250 million).


> Randomness of stress distributions is not good
> --
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: T Jake Luciani
>Assignee: Ben Slater
>Priority: Minor
>  Labels: stress
> Fix For: 4.0
>
> Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress

2017-09-15 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-13879:

Attachment: 0001-Fixing-bug.patch

Here's a patch that fixes this bug.

> cassandra-stress sleeps for entire duration even when errors halt progress
> --
>
> Key: CASSANDRA-13879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13879
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Daniel Cranford
>Priority: Minor
> Attachments: 0001-Fixing-bug.patch
>
>
> If cassandra-stress is run with a duration parameter, eg
> {noformat}
> cassandra-stress read duration=30s
> {noformat}
> then, the process will sleep for the entire duration, even when errors have 
> killed all the Consumer threads responsible for executing queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress

2017-09-15 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-13879:

Priority: Minor  (was: Major)

> cassandra-stress sleeps for entire duration even when errors halt progress
> --
>
> Key: CASSANDRA-13879
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13879
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Daniel Cranford
>Priority: Minor
>
> If cassandra-stress is run with a duration parameter, eg
> {noformat}
> cassandra-stress read duration=30s
> {noformat}
> then, the process will sleep for the entire duration, even when errors have 
> killed all the Consumer threads responsible for executing queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13879) cassandra-stress sleeps for entire duration even when errors halt progress

2017-09-15 Thread Daniel Cranford (JIRA)
Daniel Cranford created CASSANDRA-13879:
---

 Summary: cassandra-stress sleeps for entire duration even when 
errors halt progress
 Key: CASSANDRA-13879
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13879
 Project: Cassandra
  Issue Type: Bug
Reporter: Daniel Cranford


If cassandra-stress is run with a duration parameter, eg
{noformat}
cassandra-stress read duration=30s
{noformat}

then, the process will sleep for the entire duration, even when errors have 
killed all the Consumer threads responsible for executing queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations

2017-09-14 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-13871:

Attachment: 0001-Fixing-cassandra-stress-user-operations-retry.patch

Here's a patch that fixes the problem in trunk.

> cassandra-stress user command misbehaves when retrying operations
> -
>
> Key: CASSANDRA-13871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13871
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Daniel Cranford
> Attachments: 0001-Fixing-cassandra-stress-user-operations-retry.patch
>
>
> o.a.c.stress.Operation will retry queries a configurable number of times. 
> When the "user" command is invoked the o.a.c.stress.operations.userdefined 
> SchemaInsert and SchemaQuery operations are used.
> When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
> exception), they advance the PartitionIterator used to generate the keys to 
> insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each 
> retry will use a different set of keys.
> The predefined set of operations avoid this problem by packaging up the 
> arguments to bind to the query into the RunOp object so that retrying the 
> operation results in exactly the same query with the same arguments being run.
> This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
> PartitionIterator (Partition.RowIterator before the change) was reinitialized 
> prior to each query retry, thus generating the same set of keys each time.
> This problem is reported rather confusingly. The only error that shows up in 
> a log file (specified with -log file=foo.log) is the unhelpful
> {noformat}
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}
> Standard error is only slightly more helpful, displaying the ignorable 
> initial read/write error, and confusing java.util.NoSuchElementException 
> lines (caused by PartitionIterator exhaustion) followed by the above 
> IOException with stack trace, eg
> {noformat}
> com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
> during read query
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations

2017-09-14 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166886#comment-16166886
 ] 

Daniel Cranford commented on CASSANDRA-13871:
-

*Note*
This problem can be worked around by ignoring errors and disabling retries 
{noformat}-errors ignore retries=0{noformat}

> cassandra-stress user command misbehaves when retrying operations
> -
>
> Key: CASSANDRA-13871
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13871
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Daniel Cranford
>
> o.a.c.stress.Operation will retry queries a configurable number of times. 
> When the "user" command is invoked the o.a.c.stress.operations.userdefined 
> SchemaInsert and SchemaQuery operations are used.
> When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
> exception), they advance the PartitionIterator used to generate the keys to 
> insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each 
> retry will use a different set of keys.
> The predefined set of operations avoid this problem by packaging up the 
> arguments to bind to the query into the RunOp object so that retrying the 
> operation results in exactly the same query with the same arguments being run.
> This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
> PartitionIterator (Partition.RowIterator before the change) was reinitialized 
> prior to each query retry, thus generating the same set of keys each time.
> This problem is reported rather confusingly. The only error that shows up in 
> a log file (specified with -log file=foo.log) is the unhelpful
> {noformat}
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}
> Standard error is only slightly more helpful, displaying the ignorable 
> initial read/write error, and confusing java.util.NoSuchElementException 
> lines (caused by PartitionIterator exhaustion) followed by the above 
> IOException with stack trace, eg
> {noformat}
> com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
> during read query
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.util.NoSuchElementException
> java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
> (NoSuchElementException)
> at org.apache.cassandra.stress.Operation.error(Operation.java:136)
> at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
> at 
> org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
> at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations

2017-09-14 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-13871:

Description: 
o.a.c.stress.Operation will retry queries a configurable number of times. When 
the "user" command is invoked the o.a.c.stress.operations.userdefined 
SchemaInsert and SchemaQuery operations are used.

When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
exception), they advance the PartitionIterator used to generate the keys to 
insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry 
will use a different set of keys.

The predefined set of operations avoid this problem by packaging up the 
arguments to bind to the query into the RunOp object so that retrying the 
operation results in exactly the same query with the same arguments being run.

This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
PartitionIterator (Partition.RowIterator before the change) was reinitialized 
prior to each query retry, thus generating the same set of keys each time.

This problem is reported rather confusingly. The only error that shows up in a 
log file (specified with -log file=foo.log) is the unhelpful
{noformat}
java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
(NoSuchElementException)
at org.apache.cassandra.stress.Operation.error(Operation.java:136)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
at 
org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
at 
org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
{noformat}

Standard error is only slightly more helpful, displaying the ignorable initial 
read/write error, and confusing java.util.NoSuchElementException lines (caused 
by PartitionIterator exhaustion) followed by the above IOException with stack 
trace, eg
{noformat}
com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
during read query
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
(NoSuchElementException)
at org.apache.cassandra.stress.Operation.error(Operation.java:136)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
at 
org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
at 
org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
{noformat}



  was:
o.a.c.stress.Operation will retry queries a configurable number of times. When 
the "user" command is invoked the o.a.c.stress.operations.userdefined 
SchemaInsert and SchemaQuery operations are used.

When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
exception), they advance the PartitionIterator used to generate the keys to 
insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry 
will use a different set of keys.

The predefined set of operations avoid this problem by packaging up the 
arguments to bind to the query into the RunOp object so that retrying the 
operation results in exactly the same query with the same arguments being run.

This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
PartitionIterator (Partition.RowIterator before the change) was reinitialized 
prior to each query retry, thus generating the same set of keys each time.

This problem is reported rather confusingly. The only error that shows up in a 
log file (specified with -log file=foo.log) is the unhelpful
{{{
java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
(NoSuchElementException)
at org.apache.cassandra.stress.Operation.error(Operation.java:136)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
at 
org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
at 
org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
}}}

Standard error is only slightly more helpful, displaying the ignorable initial 
read/write error, and confusing java.util.NoSuchElementException lines (caused 
by PartitionIterator exhaustion) followed by the above IOException with stack 
trace, eg
{{{
com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
during read query
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException

[jira] [Created] (CASSANDRA-13871) cassandra-stress user command misbehaves when retrying operations

2017-09-14 Thread Daniel Cranford (JIRA)
Daniel Cranford created CASSANDRA-13871:
---

 Summary: cassandra-stress user command misbehaves when retrying 
operations
 Key: CASSANDRA-13871
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13871
 Project: Cassandra
  Issue Type: Bug
Reporter: Daniel Cranford


o.a.c.stress.Operation will retry queries a configurable number of times. When 
the "user" command is invoked the o.a.c.stress.operations.userdefined 
SchemaInsert and SchemaQuery operations are used.

When SchemaInsert and SchemaQuery are retried (eg after a Read/WriteTimeout 
exception), they advance the PartitionIterator used to generate the keys to 
insert/query (SchemaInsert.java:85 SchemaQuery.java:129) This means each retry 
will use a different set of keys.

The predefined set of operations avoid this problem by packaging up the 
arguments to bind to the query into the RunOp object so that retrying the 
operation results in exactly the same query with the same arguments being run.

This problem was introduced by CASSANDRA-7964. Prior to CASSANDRA-7964 the 
PartitionIterator (Partition.RowIterator before the change) was reinitialized 
prior to each query retry, thus generating the same set of keys each time.

This problem is reported rather confusingly. The only error that shows up in a 
log file (specified with -log file=foo.log) is the unhelpful
{{{
java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
(NoSuchElementException)
at org.apache.cassandra.stress.Operation.error(Operation.java:136)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
at 
org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
at 
org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
}}}

Standard error is only slightly more helpful, displaying the ignorable initial 
read/write error, and confusing java.util.NoSuchElementException lines (caused 
by PartitionIterator exhaustion) followed by the above IOException with stack 
trace, eg
{{{
com.datastax.drive.core.exceptions.ReadTimeoutException: Cassandra timeout 
during read query
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.util.NoSuchElementException
java.io.IOException Operation x10 on key(s) [foobarkey]: Error executing: 
(NoSuchElementException)
at org.apache.cassandra.stress.Operation.error(Operation.java:136)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:114)
at 
org.apache.cassandra.stress.userdefined.SchemaQuery.run(SchemaQuery.java:158)
at 
org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:459)
}}}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks

2017-08-14 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119979#comment-16119979
 ] 

Daniel Cranford edited comment on CASSANDRA-8735 at 8/14/17 2:24 PM:
-

[~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12884, so I 
attached a patch there.


was (Author: daniel.cranford):
[~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12844, so I 
attached a patch there.

> Batch log replication is not randomized when there are only 2 racks
> ---
>
> Key: CASSANDRA-8735
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8735
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Mihai Suteu
>Priority: Minor
> Fix For: 2.1.9, 2.2.1, 3.0 alpha 1
>
> Attachments: 8735-v2.patch, CASSANDRA-8735.patch
>
>
> Batch log replication is not randomized and the same 2 nodes can be picked up 
> when there are only 2 racks in the cluster.
> https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-14 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125719#comment-16125719
 ] 

Daniel Cranford commented on CASSANDRA-12884:
-

Oh, and not to nitpick, but any reason to prefer
{{otherRack.sublist(2, otherRack.size()).clear(); return otherRack();}} to 
{{return otherRack.sublist(0,2);}} ?

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Daniel Cranford
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-14 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125710#comment-16125710
 ] 

Daniel Cranford commented on CASSANDRA-12884:
-

I had originally considered using sublist to avoid creating a second ArrayList, 
but decided against it because the sublist version throws an exception in the 
degenerate case where there is only 1 element in otherRack.

But now that I trace through the code, I think that otherRack is guaranteed to 
have at least 2 elements. If otherRack is the local rack and only has 1 
element, {{if(validated.size() <= 2)}} would have been true, and the filter() 
function would have already returned. If otherRack was the single non-local 
rack, and had size 1, then {{if(validated.size() - 
validated.get(localRack).size() >= 2)}} would be false and the whole 
single-other-rack block wouldn't run. It's probably worth a comment stating 
that otherRack is guaranteed to have at least 2 elements.

Looks good!

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Daniel Cranford
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-10 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122335#comment-16122335
 ] 

Daniel Cranford commented on CASSANDRA-12884:
-

Technically, if efficiency is key, we could implement something like a 
Durstenfeld/Knuth shuffle, eg https://stackoverflow.com/a/35278327

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Daniel Cranford
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-10 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122288#comment-16122288
 ] 

Daniel Cranford commented on CASSANDRA-12884:
-

1) BatchlogManager::shuffle is stubbed out so the unit test can provide a 
deterministic override. The unit test has been expanded to provide a test which 
catches this regression. (the existing code used the same pattern for 
getRandomInt which is overridden to be non-random in the unit test)

2) getRandomInt could return the same value twice (sampling with replacement) 
resulting in the same replica being chosen. The existing code uses the 
shuffle+take head pattern, eg in BatchlogManager.java line 545 
{{shuffle((List) racks);}} and line 550 {{for (String rack : 
Iterables.limit(racks, 2))}} to perform sampling without replacement.


> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Daniel Cranford
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-09 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119980#comment-16119980
 ] 

Daniel Cranford commented on CASSANDRA-12884:
-

Same bug. Regression.

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Joshua McKenzie
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-09 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119980#comment-16119980
 ] 

Daniel Cranford edited comment on CASSANDRA-12884 at 8/9/17 2:26 PM:
-

Same bug as CASSANDRA-8735. Regression.


was (Author: daniel.cranford):
Same bug. Regression.

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Joshua McKenzie
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks

2017-08-09 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119979#comment-16119979
 ] 

Daniel Cranford commented on CASSANDRA-8735:


[~iamaleksey] Great, I didn't see any activity yet on CASSANDRA-12844, so I 
attached a patch there.

> Batch log replication is not randomized when there are only 2 racks
> ---
>
> Key: CASSANDRA-8735
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8735
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Mihai Suteu
>Priority: Minor
> Fix For: 2.1.9, 2.2.1, 3.0 alpha 1
>
> Attachments: 8735-v2.patch, CASSANDRA-8735.patch
>
>
> Batch log replication is not randomized and the same 2 nodes can be picked up 
> when there are only 2 racks in the cluster.
> https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-12884) Batch logic can lead to unbalanced use of system.batches

2017-08-09 Thread Daniel Cranford (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Cranford updated CASSANDRA-12884:

Attachment: 0001-CASSANDRA-12884.patch

Fix + improved unit tests.

> Batch logic can lead to unbalanced use of system.batches
> 
>
> Key: CASSANDRA-12884
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12884
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Adam Hattrell
>Assignee: Joshua McKenzie
> Fix For: 3.0.x, 3.11.x
>
> Attachments: 0001-CASSANDRA-12884.patch
>
>
> It looks as though there are some odd edge cases in how we distribute the 
> copies in system.batches.
> The main issue is in the filter method for 
> org.apache.cassandra.batchlog.BatchlogManager
> {code:java}
>  if (validated.size() - validated.get(localRack).size() >= 2)
>  {
> // we have enough endpoints in other racks
> validated.removeAll(localRack);
>   }
>  if (validated.keySet().size() == 1)
>  {
>// we have only 1 `other` rack
>Collection otherRack = 
> Iterables.getOnlyElement(validated.asMap().values());
>
> return Lists.newArrayList(Iterables.limit(otherRack, 2));
>  }
> {code}
> So with one or two racks we just return the first 2 entries in the list.  
> There's no shuffle or randomisation here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-8735) Batch log replication is not randomized when there are only 2 racks

2017-08-04 Thread Daniel Cranford (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114984#comment-16114984
 ] 

Daniel Cranford commented on CASSANDRA-8735:


Looks like it was reverted/overwritten (seemingly unintentionally) by the fix 
for CASSANDRA-7237

We've seen this in production with Cassandra 3.9. One or two rack DCs always 
select the same two hosts for batch log replication.

> Batch log replication is not randomized when there are only 2 racks
> ---
>
> Key: CASSANDRA-8735
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8735
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yuki Morishita
>Assignee: Mihai Suteu
>Priority: Minor
> Fix For: 2.1.9, 2.2.1, 3.0 alpha 1
>
> Attachments: 8735-v2.patch, CASSANDRA-8735.patch
>
>
> Batch log replication is not randomized and the same 2 nodes can be picked up 
> when there are only 2 racks in the cluster.
> https://github.com/apache/cassandra/blob/cassandra-2.0.11/src/java/org/apache/cassandra/service/BatchlogEndpointSelector.java#L72-73



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org