[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231934#comment-17231934 ] David Capwell commented on CASSANDRA-16213: --- Finished assassinate and made sure to flesh out the different cases I could see. org.apache.cassandra.gms.EndpointState#isEmpty does need to check for status in order for assassinate with this patch. If you stop all nodes and bring up all but the host to remove, then assassinate the node to remove, it will still be "empty" based off version, but will have a status. If we do not check the status when we check for empty, we would then treat this endpoint as normal and move on, which isn't correct as its in the LEFT state. [~paulo] I added org.apache.cassandra.distributed.test.hostreplacement.AssassinatedEmptyNodeTest to flesh this case out if you want to take a closer look. EndpointState.isEmpty is only use in one spot now since we removed the filter, so feel its still best to check the state to make sure it is this specific case. > Cannot replace_address /X because it doesn't exist in gossip > > > Key: CASSANDRA-16213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16213 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Membership >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > We see this exception around nodes crashing and trying to do a host > replacement; this error appears to be correlated around multiple node > failures. > A simplified case to trigger this is the following > *) Have a N node cluster > *) Shutdown all N nodes > *) Bring up N-1 nodes (at least 1 seed, else replace seed) > *) Host replace the N-1th node -> this will fail with the above > The reason this happens is that the N-1th node isn’t gossiping anymore, and > the existing nodes do not have its details in gossip (but have the details in > the peers table), so the host replacement fails as the node isn’t known in > gossip. > This affects all versions (tested 3.0 and trunk, assume 2.2 as well) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit
[ https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231885#comment-17231885 ] Caleb Rackliffe commented on CASSANDRA-16181: - [~adelapena] I think I've gotten to the point of diminishing returns on the summary doc, so take a look when you have a chance. Aside from some minor filling out of some of the unit tests, I think the biggest thing I'd want to do outside of CASSANDRA-16262 is creating a more comprehensive upgrade test along the lines of what you did for the mixed mode read repair tests and "scenario modeled" after {{TestReplication}} in {{replication_test.py}}. It probably won't be too hard to dynamically execute local reads across a fairly small dataset to verify that things are being replicated to the right places. WDYT? > 4.0 Quality: Replication Test Audit > --- > > Key: CASSANDRA-16181 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16181 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Andres de la Peña >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a subtask of CASSANDRA-15579 focusing on replication. > I think that the main reference dtest for this is > [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. > We should identify which other tests cover this and identify what should be > extended, similarly to what has been done with CASSANDRA-15977. > The doc > [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] > describes the existing state of testing around replication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit
[ https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-16181: Description: This is a subtask of CASSANDRA-15579 focusing on replication. I think that the main reference dtest for this is [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. We should identify which other tests cover this and identify what should be extended, similarly to what has been done with CASSANDRA-15977. The doc [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] describes the existing state of testing around replication. was: This is a subtask of CASSANDRA-15579 focusing on replication. I think that the main reference dtest for this is [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. We should identify which other tests cover this and identify what should be extended, similarly to what has been done with CASSANDRA-15977. The (WIP) doc [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] describes the existing state of testing around replication. > 4.0 Quality: Replication Test Audit > --- > > Key: CASSANDRA-16181 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16181 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Andres de la Peña >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a subtask of CASSANDRA-15579 focusing on replication. > I think that the main reference dtest for this is > [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. > We should identify which other tests cover this and identify what should be > extended, similarly to what has been done with CASSANDRA-15977. > The doc > [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] > describes the existing state of testing around replication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16261) Prevent unbounded number of flushing tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-16261: Test and Documentation Plan: https://issues.apache.org/jira/browse/CASSANDRA-16261?focusedCommentId=17231874=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17231874 Status: Patch Available (was: In Progress) > Prevent unbounded number of flushing tasks > -- > > Key: CASSANDRA-16261 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16261 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 3.11.x, 4.0-beta4 > > > The cleaner thread is not prevented from queueing an unbounded number of > flushing tasks for memtables that are almost empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16261) Prevent unbounded number of flushing tasks
[ https://issues.apache.org/jira/browse/CASSANDRA-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231874#comment-17231874 ] Ekaterina Dimitrova commented on CASSANDRA-16261: - This patch puts a cap on the maximum number of flushing tasks that can be enqueued by the memtable cleaner thread. This will have the consequence of creating larger sstables if flushing cannot keep up, so we must choose the maximum number of pending tasks carefully. At the moment it is [configurable |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-16261-trunk/src/java/org/apache/cassandra/db/Memtable.java#L76] and set to twice the number of flush writers. When a memtable gets into discarding state, all pending updates update both used and reclaiming. [trunk|https://github.com/ekaterinadimitrova2/cassandra/pull/76/commits/6ffa1802a1cfb8420db8a253ae5312fcffddfd6a] | [JAVA8 CI |https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/506/workflows/4411817d-bd4d-449a-b28e-8f9616eaf1f4] | [JAVA 11 CI |https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/506/workflows/9646169b-26bd-4d5b-aa5b-ba7d6522786d] [3.11|https://github.com/ekaterinadimitrova2/cassandra/commit/1f0d02d0a8d04524a99574dc60c0f4b215520591] | [CI |https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/508/workflows/78295ba3-6a08-4e04-a39e-cf46a7913d02] No new test failures, one more test class added - [MemtableCleanerThreadTest |https://github.com/ekaterinadimitrova2/cassandra/pull/76/commits/6ffa1802a1cfb8420db8a253ae5312fcffddfd6a#diff-ef05bf02f6f0b3ab7db707faeb3a6c0e69f67d330de07ca0876cccaf1ea9395fR44]. [~adelapena] do you mind to review it? > Prevent unbounded number of flushing tasks > -- > > Key: CASSANDRA-16261 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16261 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 3.11.x, 4.0-beta4 > > > The cleaner thread is not prevented from queueing an unbounded number of > flushing tasks for memtables that are almost empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-4938) CREATE INDEX can block for creation now that schema changes may be concurrent
[ https://issues.apache.org/jira/browse/CASSANDRA-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirk True reassigned CASSANDRA-4938: Assignee: Kirk True > CREATE INDEX can block for creation now that schema changes may be concurrent > - > > Key: CASSANDRA-4938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4938 > Project: Cassandra > Issue Type: Improvement > Components: Feature/2i Index >Reporter: Krzysztof Cieslinski Cognitum >Assignee: Kirk True >Priority: Low > Labels: lhf > Fix For: 4.x > > > Response from CREATE INDEX command comes faster than the creation of > secondary index. So below code: > {code:xml} > CREATE INDEX ON tab(name); > SELECT * FROM tab WHERE name = 'Chris'; > {code} > doesn't return any rows(of course, in column family "tab", there are some > records with "name" value = 'Chris'..) and any errors ( i would expect > something like ??"Bad Request: No indexed columns present in by-columns > clause with Equal operator"??) > Inputing some timeout between those two commands resolves the problem, so: > {code:xml} > CREATE INDEX ON tab(name); > Sleep(timeout); // for column family with 2000 rows the timeout had to be set > for ~1 second > SELECT * FROM tab WHERE name = 'Chris'; > {code} > will return all rows with values as specified. > I'm using single node cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit
[ https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231864#comment-17231864 ] Caleb Rackliffe commented on CASSANDRA-16181: - [~adelapena] Created a PR [here|https://github.com/apache/cassandra/pull/821] to track the little things I've been tinkering w/ here and there. (No urgent need to review...) > 4.0 Quality: Replication Test Audit > --- > > Key: CASSANDRA-16181 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16181 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Andres de la Peña >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a subtask of CASSANDRA-15579 focusing on replication. > I think that the main reference dtest for this is > [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. > We should identify which other tests cover this and identify what should be > extended, similarly to what has been done with CASSANDRA-15977. > The (WIP) doc > [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] > describes the existing state of testing around replication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231832#comment-17231832 ] David Capwell commented on CASSANDRA-16213: --- [~paulo] added the test and rebased to latest trunk as 2 recent commits impact this logic. I am going to run the tests in a loop to make sure they are not flaky, if they are will split the class files or change the bootstrap schema properties. The last thing on my plate is to validate assassinate; forgot to do this. > Cannot replace_address /X because it doesn't exist in gossip > > > Key: CASSANDRA-16213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16213 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Membership >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > We see this exception around nodes crashing and trying to do a host > replacement; this error appears to be correlated around multiple node > failures. > A simplified case to trigger this is the following > *) Have a N node cluster > *) Shutdown all N nodes > *) Bring up N-1 nodes (at least 1 seed, else replace seed) > *) Host replace the N-1th node -> this will fail with the above > The reason this happens is that the N-1th node isn’t gossiping anymore, and > the existing nodes do not have its details in gossip (but have the details in > the peers table), so the host replacement fails as the node isn’t known in > gossip. > This affects all versions (tested 3.0 and trunk, assume 2.2 as well) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231814#comment-17231814 ] David Capwell commented on CASSANDRA-15158: --- Committed https://github.com/apache/cassandra/commit/7d6f9b94dd0d00bfd29374d7a645e650f451023d > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782 ] David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:38 PM: -- Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, -trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s-, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| was (Author: dcapwell): Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the
[cassandra] branch cassandra-3.11 updated (8ffa79f -> 50d8245)
This is an automated email from the ASF dual-hosted git repository. dcapwell pushed a change to branch cassandra-3.11 in repository https://gitbox.apache.org/repos/asf/cassandra.git. from 8ffa79f Merge branch 'cassandra-3.0' into cassandra-3.11 new 17ebee3 CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis) new 50d8245 Merge branch 'cassandra-3.0' into cassandra-3.11 The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/java/org/apache/cassandra/service/StorageService.java | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] branch trunk updated (94663c3 -> e4fac35)
This is an automated email from the ASF dual-hosted git repository. dcapwell pushed a change to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git. from 94663c3 Relax < check to <= for NodeToolGossipInfoTest new 17ebee3 CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis) new 50d8245 Merge branch 'cassandra-3.0' into cassandra-3.11 new e4fac35 Merge branch 'cassandra-3.11' into trunk The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../apache/cassandra/config/CassandraRelevantProperties.java | 11 +++ src/java/org/apache/cassandra/service/StorageService.java| 12 +++- .../apache/cassandra/distributed/action/GossipHelper.java| 7 ++- .../cassandra/distributed/test/ring/BootstrapTest.java | 6 -- 4 files changed, 28 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] 01/01: Merge branch 'cassandra-3.0' into cassandra-3.11
This is an automated email from the ASF dual-hosted git repository. dcapwell pushed a commit to branch cassandra-3.11 in repository https://gitbox.apache.org/repos/asf/cassandra.git commit 50d8245d76aa76747f8bd6ae3947d22e5a02d290 Merge: 8ffa79f 17ebee3 Author: David Capwell AuthorDate: Fri Nov 13 12:36:31 2020 -0800 Merge branch 'cassandra-3.0' into cassandra-3.11 src/java/org/apache/cassandra/service/StorageService.java | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --cc src/java/org/apache/cassandra/service/StorageService.java index 734f176,a72b530..eb13df1 --- a/src/java/org/apache/cassandra/service/StorageService.java +++ b/src/java/org/apache/cassandra/service/StorageService.java @@@ -149,7 -143,7 +149,7 @@@ public class StorageService extends Not String newdelay = System.getProperty("cassandra.schema_delay_ms"); if (newdelay != null) { --logger.info("Overriding SCHEMA_DELAY to {}ms", newdelay); ++logger.info("Overriding SCHEMA_DELAY_MILLIS to {}ms", newdelay); return Integer.parseInt(newdelay); } else - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] 01/01: Merge branch 'cassandra-3.11' into trunk
This is an automated email from the ASF dual-hosted git repository. dcapwell pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git commit e4fac3582e0a9dda182313a3aa784be35d965f4e Merge: 94663c3 50d8245 Author: David Capwell AuthorDate: Fri Nov 13 12:37:46 2020 -0800 Merge branch 'cassandra-3.11' into trunk .../apache/cassandra/config/CassandraRelevantProperties.java | 11 +++ src/java/org/apache/cassandra/service/StorageService.java| 12 +++- .../apache/cassandra/distributed/action/GossipHelper.java| 7 ++- .../cassandra/distributed/test/ring/BootstrapTest.java | 6 -- 4 files changed, 28 insertions(+), 8 deletions(-) diff --cc src/java/org/apache/cassandra/config/CassandraRelevantProperties.java index 881b7d9,000..7402aa1 mode 100644,00..100644 --- a/src/java/org/apache/cassandra/config/CassandraRelevantProperties.java +++ b/src/java/org/apache/cassandra/config/CassandraRelevantProperties.java @@@ -1,240 -1,0 +1,251 @@@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.cassandra.config; + +import org.apache.cassandra.exceptions.ConfigurationException; + +/** A class that extracts system properties for the cassandra node it runs within. */ +public enum CassandraRelevantProperties +{ +//base JVM properties +JAVA_HOME("java.home"), +CASSANDRA_PID_FILE ("cassandra-pidfile"), + +/** + * Indicates the temporary directory used by the Java Virtual Machine (JVM) + * to create and store temporary files. + */ +JAVA_IO_TMPDIR ("java.io.tmpdir"), + +/** + * Path from which to load native libraries. + * Default is absolute path to lib directory. + */ +JAVA_LIBRARY_PATH ("java.library.path"), + +JAVA_SECURITY_EGD ("java.security.egd"), + +/** Java Runtime Environment version */ +JAVA_VERSION ("java.version"), + +/** Java Virtual Machine implementation name */ +JAVA_VM_NAME ("java.vm.name"), + +/** Line separator ("\n" on UNIX). */ +LINE_SEPARATOR ("line.separator"), + +/** Java class path. */ +JAVA_CLASS_PATH ("java.class.path"), + +/** Operating system architecture. */ +OS_ARCH ("os.arch"), + +/** Operating system name. */ +OS_NAME ("os.name"), + +/** User's home directory. */ +USER_HOME ("user.home"), + +/** Platform word size sun.arch.data.model. Examples: "32", "64", "unknown"*/ +SUN_ARCH_DATA_MODEL ("sun.arch.data.model"), + +//JMX properties +/** + * The value of this property represents the host name string + * that should be associated with remote stubs for locally created remote objects, + * in order to allow clients to invoke methods on the remote object. + */ +JAVA_RMI_SERVER_HOSTNAME ("java.rmi.server.hostname"), + +/** + * If this value is true, object identifiers for remote objects exported by this VM will be generated by using + * a cryptographically secure random number generator. The default value is false. + */ +JAVA_RMI_SERVER_RANDOM_ID ("java.rmi.server.randomIDs"), + +/** + * This property indicates whether password authentication for remote monitoring is + * enabled. By default it is disabled - com.sun.management.jmxremote.authenticate + */ +COM_SUN_MANAGEMENT_JMXREMOTE_AUTHENTICATE ("com.sun.management.jmxremote.authenticate"), + +/** + * The port number to which the RMI connector will be bound - com.sun.management.jmxremote.rmi.port. + * An Integer object that represents the value of the second argument is returned + * if there is no port specified, if the port does not have the correct numeric format, + * or if the specified name is empty or null. + */ +COM_SUN_MANAGEMENT_JMXREMOTE_RMI_PORT ("com.sun.management.jmxremote.rmi.port", "0"), + +/** Cassandra jmx remote port */ +CASSANDRA_JMX_REMOTE_PORT("cassandra.jmx.remote.port"), + +/** This property indicates whether SSL is enabled for monitoring remotely. Default is set to false. */ +COM_SUN_MANAGEMENT_JMXREMOTE_SSL
[cassandra] branch cassandra-3.0 updated: CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis)
This is an automated email from the ASF dual-hosted git repository. dcapwell pushed a commit to branch cassandra-3.0 in repository https://gitbox.apache.org/repos/asf/cassandra.git The following commit(s) were added to refs/heads/cassandra-3.0 by this push: new 17ebee3 CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis) 17ebee3 is described below commit 17ebee3186d1bfdee9a2b355cb8f139492d144e8 Author: David Capwell AuthorDate: Fri Nov 13 11:18:55 2020 -0800 CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis) patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-15158 --- src/java/org/apache/cassandra/service/StorageService.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/java/org/apache/cassandra/service/StorageService.java b/src/java/org/apache/cassandra/service/StorageService.java index 3718e8c..a72b530 100644 --- a/src/java/org/apache/cassandra/service/StorageService.java +++ b/src/java/org/apache/cassandra/service/StorageService.java @@ -113,7 +113,7 @@ public class StorageService extends NotificationBroadcasterSupport implements IE private static final Logger logger = LoggerFactory.getLogger(StorageService.class); public static final int RING_DELAY = getRingDelay(); // delay after which we assume ring has stablized -public static final int SCHEMA_DELAY = getRingDelay(); // delay after which we assume ring has stablized +public static final int SCHEMA_DELAY_MILLIS = getSchemaDelay(); private static final boolean REQUIRE_SCHEMAS = !Boolean.getBoolean("cassandra.skip_schema_check"); @@ -873,7 +873,7 @@ public class StorageService extends NotificationBroadcasterSupport implements IE Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS); } -boolean schemasReceived = MigrationCoordinator.instance.awaitSchemaRequests(TimeUnit.SECONDS.toMillis(SCHEMA_DELAY)); +boolean schemasReceived = MigrationCoordinator.instance.awaitSchemaRequests(SCHEMA_DELAY_MILLIS); if (schemasReceived) return; - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231810#comment-17231810 ] Blake Eggleston commented on CASSANDRA-15158: - Thanks David, +1 > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782 ] David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:20 PM: -- Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| was (Author: dcapwell): Starting commit CI Results (pending): ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system
[jira] [Updated] (CASSANDRA-16262) 4.0 Quality: Coordination & Replication Fuzz Testing
[ https://issues.apache.org/jira/browse/CASSANDRA-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-16262: Description: CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on auditing the existing tests around coordination, replication, and read-repair, respectively. We've expanded existing test cases, added coverage around components that we've refactored along the way, and added in-JVM dtest upgrade tests where possible. What remains is verifying the distributed read and write paths in the face of common operational events, namely node restarts, bootstrapping, decommission, and cleanup. If we can find a way to simulate these events, [Harry|https://github.com/apache/cassandra-harry] seems like a good candidate to host the verification logic itself. To keep things simple initially, I would propose that we start by testing simple read-only and write-only workloads (the former without read repair). was: CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on auditing the existing tests around coordination, replication, and read-repair, respectively. We've expanded existing test cases, added coverage around components that we've refactored along the way, and added in-JVM dtest upgrade tests where possible. What remains is verifying the distributed read and write paths in the face of common operational events, namely node restarts, bootstrapping, and decommission. If we can find a way to simulate these events, [Harry|https://github.com/apache/cassandra-harry] seems like a good candidate to host the verification logic itself. To keep things simple initially, I would propose that we start by testing simple read-only and write-only workloads (the former without read repair). > 4.0 Quality: Coordination & Replication Fuzz Testing > > > Key: CASSANDRA-16262 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16262 > Project: Cassandra > Issue Type: Task > Components: Test/fuzz >Reporter: Caleb Rackliffe >Priority: Normal > Fix For: 4.0-rc > > > CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on > auditing the existing tests around coordination, replication, and > read-repair, respectively. We've expanded existing test cases, added coverage > around components that we've refactored along the way, and added in-JVM dtest > upgrade tests where possible. > What remains is verifying the distributed read and write paths in the face of > common operational events, namely node restarts, bootstrapping, decommission, > and cleanup. If we can find a way to simulate these events, > [Harry|https://github.com/apache/cassandra-harry] seems like a good candidate > to host the verification logic itself. > To keep things simple initially, I would propose that we start by testing > simple read-only and write-only workloads (the former without read repair). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Holmberg updated CASSANDRA-16185: -- Status: Patch Available (was: Review In Progress) > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782 ] David Capwell commented on CASSANDRA-15158: --- Starting commit CI Results (pending): ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231777#comment-17231777 ] David Capwell commented on CASSANDRA-15158: --- Patches: 3.0: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-3.0 3.11: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-3.11 trunk: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-trunk A test was added in CASSANDRA-16213 which showed this issue, its only stable once this patch is applied (and disable failing) > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231767#comment-17231767 ] David Capwell commented on CASSANDRA-16213: --- I plan to fix the schema wait logic in https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231757#comment-17231757 to keep this patch clean of it, but for the moment the logic is in this branch to get a stable test. > Cannot replace_address /X because it doesn't exist in gossip > > > Key: CASSANDRA-16213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16213 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Membership >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > We see this exception around nodes crashing and trying to do a host > replacement; this error appears to be correlated around multiple node > failures. > A simplified case to trigger this is the following > *) Have a N node cluster > *) Shutdown all N nodes > *) Bring up N-1 nodes (at least 1 seed, else replace seed) > *) Host replace the N-1th node -> this will fail with the above > The reason this happens is that the N-1th node isn’t gossiping anymore, and > the existing nodes do not have its details in gossip (but have the details in > the peers table), so the host replacement fails as the node isn’t known in > gossip. > This affects all versions (tested 3.0 and trunk, assume 2.2 as well) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231757#comment-17231757 ] David Capwell commented on CASSANDRA-15158: --- Found a small set of typos which cause us to wait for schemas for 8h20m rather than 30s, going to submit a patch here and fix in all 3 branches... > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231736#comment-17231736 ] David Capwell edited comment on CASSANDRA-16213 at 11/13/20, 6:30 PM: -- Found the issue, it was caused by CASSANDRA-15158 where it creates a config of milliseconds, calls a delay which takes milliseconds, but converts the mills as if they were seconds, causing a much longer delay than expected. Once I fix that I then hit the next issue, we now block waiting on schema which will fail since it has a downed node. {code} case SCHEMA: SystemKeyspace.updatePeerInfo(endpoint, "schema_version", UUID.fromString(value.value)); MigrationCoordinator.instance.reportEndpointVersion(endpoint, UUID.fromString(value.value)); break; {code} {code} boolean schemasReceived = MigrationCoordinator.instance.awaitSchemaRequests(SCHEMA_DELAY_MILLIS); if (schemasReceived) return; logger.warn(String.format("There are nodes in the cluster with a different schema version than us we did not merged schemas from, " + "our version : (%s), outstanding versions -> endpoints : %s", Schema.instance.getVersion(), MigrationCoordinator.instance.outstandingVersions())); if (REQUIRE_SCHEMAS) throw new RuntimeException("Didn't receive schemas for all known versions within the timeout"); {code} when we get the gossip info from the peers it will have node2 (the node that crashed abruptly) and wait until it gets the schema, but this won't happen since node2 is down and we are replacing it. This looks unrelated to this patch, but also is a bad condition as any schema change with a downed node will cause nodes to fail to start up... was (Author: dcapwell): Found the issue, it was caused by CASSANDRA-15158 where it creates a config of milliseconds, calls a delay which takes milliseconds, but converts the mills as if they were seconds, causing a much longer delay than expected. Once I fix that I then hit the next issue, we now block waiting on schema which will fail since it has a downed node. {code} case SCHEMA: SystemKeyspace.updatePeerInfo(endpoint, "schema_version", UUID.fromString(value.value)); MigrationCoordinator.instance.reportEndpointVersion(endpoint, UUID.fromString(value.value)); break; {code} when we get the gossip info from the peers it will have node2 (the node that crashed abruptly) and wait until it gets the schema, but this won't happen since node2 is down and we are replacing it. This looks unrelated to this patch, but also is a bad condition as any schema change with a downed node will cause nodes to fail to start up... > Cannot replace_address /X because it doesn't exist in gossip > > > Key: CASSANDRA-16213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16213 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Membership >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > We see this exception around nodes crashing and trying to do a host > replacement; this error appears to be correlated around multiple node > failures. > A simplified case to trigger this is the following > *) Have a N node cluster > *) Shutdown all N nodes > *) Bring up N-1 nodes (at least 1 seed, else replace seed) > *) Host replace the N-1th node -> this will fail with the above > The reason this happens is that the N-1th node isn’t gossiping anymore, and > the existing nodes do not have its details in gossip (but have the details in > the peers table), so the host replacement fails as the node isn’t known in > gossip. > This affects all versions (tested 3.0 and trunk, assume 2.2 as well) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip
[ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231736#comment-17231736 ] David Capwell commented on CASSANDRA-16213: --- Found the issue, it was caused by CASSANDRA-15158 where it creates a config of milliseconds, calls a delay which takes milliseconds, but converts the mills as if they were seconds, causing a much longer delay than expected. Once I fix that I then hit the next issue, we now block waiting on schema which will fail since it has a downed node. {code} case SCHEMA: SystemKeyspace.updatePeerInfo(endpoint, "schema_version", UUID.fromString(value.value)); MigrationCoordinator.instance.reportEndpointVersion(endpoint, UUID.fromString(value.value)); break; {code} when we get the gossip info from the peers it will have node2 (the node that crashed abruptly) and wait until it gets the schema, but this won't happen since node2 is down and we are replacing it. This looks unrelated to this patch, but also is a bad condition as any schema change with a downed node will cause nodes to fail to start up... > Cannot replace_address /X because it doesn't exist in gossip > > > Key: CASSANDRA-16213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16213 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Membership >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 4.0-beta > > > We see this exception around nodes crashing and trying to do a host > replacement; this error appears to be correlated around multiple node > failures. > A simplified case to trigger this is the following > *) Have a N node cluster > *) Shutdown all N nodes > *) Bring up N-1 nodes (at least 1 seed, else replace seed) > *) Host replace the N-1th node -> this will fail with the above > The reason this happens is that the N-1th node isn’t gossiping anymore, and > the existing nodes do not have its details in gossip (but have the details in > the peers table), so the host replacement fails as the node isn’t known in > gossip. > This affects all versions (tested 3.0 and trunk, assume 2.2 as well) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231691#comment-17231691 ] Sam Tunnicliffe commented on CASSANDRA-15299: - I found 1 dtest which broke when updating the driver to current master (e1fc528a0d), and opened CASSANDRA-16275 to fix it. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16275) Update python driver used by cassandra-dtest
[ https://issues.apache.org/jira/browse/CASSANDRA-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-16275: Change Category: Quality Assurance Complexity: Low Hanging Fruit Assignee: Sam Tunnicliffe Status: Open (was: Triage Needed) > Update python driver used by cassandra-dtest > > > Key: CASSANDRA-16275 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16275 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Sam Tunnicliffe >Assignee: Sam Tunnicliffe >Priority: Normal > > In order to commit CASSANDRA-15299, the python driver used by the dtests > needs to include PYTHON-1258, support for V5 framing. > Updating the python driver's cassandra-test branch to latest trunk causes 1 > additional dtest failure in > {{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the > {{ServerError}} response is now subject to the configured {{retry_policy}}. > This means the error ultimately returned from the driver is > {{NoHostAvailable}}, rather than {{ServerError}}. > I'll open a dtest pr to change the expectation in the test and we can commit > that when the cassandra-test branch is updated. > cc [~aholmber] [~aboudreault] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16275) Update python driver used by cassandra-dtest
[ https://issues.apache.org/jira/browse/CASSANDRA-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231689#comment-17231689 ] Sam Tunnicliffe commented on CASSANDRA-16275: - https://github.com/apache/cassandra-dtest/pull/103 > Update python driver used by cassandra-dtest > > > Key: CASSANDRA-16275 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16275 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Sam Tunnicliffe >Priority: Normal > > In order to commit CASSANDRA-15299, the python driver used by the dtests > needs to include PYTHON-1258, support for V5 framing. > Updating the python driver's cassandra-test branch to latest trunk causes 1 > additional dtest failure in > {{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the > {{ServerError}} response is now subject to the configured {{retry_policy}}. > This means the error ultimately returned from the driver is > {{NoHostAvailable}}, rather than {{ServerError}}. > I'll open a dtest pr to change the expectation in the test and we can commit > that when the cassandra-test branch is updated. > cc [~aholmber] [~aboudreault] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra-dtest] branch 16275 deleted (was e3c4f69)
This is an automated email from the ASF dual-hosted git repository. samt pushed a change to branch 16275 in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git. was e3c4f69 Change expected response when role data is corrupt This change permanently discards the following revisions: discard e3c4f69 Change expected response when role data is corrupt - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra-dtest] branch 16275 created (now e3c4f69)
This is an automated email from the ASF dual-hosted git repository. samt pushed a change to branch 16275 in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git. at e3c4f69 Change expected response when role data is corrupt This branch includes the following new commits: new e3c4f69 Change expected response when role data is corrupt The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra-dtest] 01/01: Change expected response when role data is corrupt
This is an automated email from the ASF dual-hosted git repository. samt pushed a commit to branch 16275 in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git commit e3c4f695abccb3ac37bf8c08d2fa304029372b50 Author: Sam Tunnicliffe AuthorDate: Fri Nov 13 17:36:22 2020 + Change expected response when role data is corrupt --- auth_test.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/auth_test.py b/auth_test.py index 1061b17..6d92ff5 100644 --- a/auth_test.py +++ b/auth_test.py @@ -216,7 +216,7 @@ class TestAuth(Tester): self.fixture_dtest_setup.ignore_log_patterns = list(self.fixture_dtest_setup.ignore_log_patterns) + [ r'Invalid metadata has been detected for role bob'] -assert_exception(session, "LIST USERS", "Invalid metadata has been detected for role", expected=(ServerError)) +assert_exception(session, "LIST USERS", "Invalid metadata has been detected for role", expected=(NoHostAvailable)) try: self.get_session(user='bob', password='12345') except NoHostAvailable as e: - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16275) Update python driver used by cassandra-dtest
Sam Tunnicliffe created CASSANDRA-16275: --- Summary: Update python driver used by cassandra-dtest Key: CASSANDRA-16275 URL: https://issues.apache.org/jira/browse/CASSANDRA-16275 Project: Cassandra Issue Type: Task Components: Test/dtest/python Reporter: Sam Tunnicliffe In order to commit CASSANDRA-15299, the python driver used by the dtests needs to include PYTHON-1258, support for V5 framing. Updating the python driver's cassandra-test branch to latest trunk causes 1 additional dtest failure in {{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the {{ServerError}} response is now subject to the configured {{retry_policy}}. This means the error ultimately returned from the driver is {{NoHostAvailable}}, rather than {{ServerError}}. I'll open a dtest pr to change the expectation in the test and we can commit that when the cassandra-test branch is updated. cc [~aholmber] [~aboudreault] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16191) Add tests for Repair metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Holmberg updated CASSANDRA-16191: -- Component/s: (was: Test/dtest/java) Test/dtest/python > Add tests for Repair metrics > > > Key: CASSANDRA-16191 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16191 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > > We do not seems to have some tests for the {{RepairMetrics.previewFailures}} > counter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16247: - Fix Version/s: 4.0-beta4 Since Version: 4.0-beta2 Source Control Link: https://github.com/apache/cassandra/commit/94663c314a8a2c69a90cc64ac7e60344ba1c60ce Resolution: Fixed Status: Resolved (was: Ready to Commit) Committed w/rename nit. > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta, 4.0-beta4 > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] branch trunk updated: Relax < check to <= for NodeToolGossipInfoTest
This is an automated email from the ASF dual-hosted git repository. brandonwilliams pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git The following commit(s) were added to refs/heads/trunk by this push: new 94663c3 Relax < check to <= for NodeToolGossipInfoTest 94663c3 is described below commit 94663c314a8a2c69a90cc64ac7e60344ba1c60ce Author: Brandon Williams AuthorDate: Thu Nov 12 13:45:21 2020 -0600 Relax < check to <= for NodeToolGossipInfoTest Patch by brandonwilliams, reviewed by samt for CASSANDRA-16247 --- test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java b/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java index e69f860..caca5ae 100644 --- a/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java +++ b/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java @@ -97,7 +97,7 @@ public class NodeToolGossipInfoTest extends CQLTester } @Test -public void testTPStats() throws Throwable +public void testGossipInfo() throws Throwable { ToolResult tool = ToolRunner.invokeNodetool("gossipinfo"); Assertions.assertThat(tool.getStdout()).contains("/127.0.0.1"); @@ -125,6 +125,6 @@ public class NodeToolGossipInfoTest extends CQLTester assertTrue(tool.getCleanedStderr().isEmpty()); assertEquals(0, tool.getExitCode()); String newHeartbeatCount = StringUtils.substringBetween(tool.getStdout(), "heartbeat:", "\n"); -assertTrue(Integer.parseInt(origHeartbeatCount) < Integer.parseInt(newHeartbeatCount)); +assertTrue(Integer.parseInt(origHeartbeatCount) <= Integer.parseInt(newHeartbeatCount)); } } - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16247: - Status: Ready to Commit (was: Review In Progress) > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231626#comment-17231626 ] Sam Tunnicliffe commented on CASSANDRA-16247: - +1 > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16247: - Reviewers: Sam Tunnicliffe (was: Brandon Williams, Sam Tunnicliffe) > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16247: - Test and Documentation Plan: https://ci-cassandra.apache.org/job/Cassandra-devbranch/199/ Status: Patch Available (was: In Progress) Patch to compare <= instead of <, since that is more correct as to how the heartbeat increments with regard to query timing. > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
[ https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16247: - Reviewers: Sam Tunnicliffe, Brandon Williams (was: Brandon Williams, Sam Tunnicliffe) Sam Tunnicliffe, Brandon Williams Status: Review In Progress (was: Patch Available) > Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest > -- > > Key: CASSANDRA-16247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16247 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Reviewers: Adam Holmberg Status: Review In Progress (was: Patch Available) > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Attachment: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Attachment: (was: 0001-Unit-Test-cases-for-CommitLogMetrics.patch) > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Test and Documentation Plan: CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases 1. For below metrics, test-case are added in +*CommitLogTest.java*+ . In test-cases, mutations are added, which internally update metrics. * *completedTasks* * *totalCommitLogSize* * *waitingOnCommit* 2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added testPendingTasks test-case, since pendingTasks metric are incremented and decremented in single method of *AbstractCommitLogService::maybeWaitForSync* . So introduced a dummy method incrementPendingTaks in *FakeCommitLogService-class,* to manually update pendingTasks metric. 3. *waitingOnSegmentAllocation* - In +*CommitLogMetricsTest.java*+ test-case is added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding multiple mutations so that waitingOnSegmentAllocation metric updated while creating new segments. In normal case , waitingOnSegmentAllocation it will be zero, if during test-execution, waitingOnSegmentAllocation was not updated, then manually updating it. was: CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases 1. For below metrics, test-case are added in +*CommitLogTest.java*+ . In test-cases, mutations are added, which internally update metrics. * *completedTasks* * *totalCommitLogSize* * *waitingOnCommit* 2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added testPendingTasks test-case, since pendingTasks metric are incremented and decremented in single method of *AbstractCommitLogService::maybeWaitForSync* . So introduced a dummy method incrementPendingTaks in *FakeCommitLogService-class,* to manually update pendingTasks metric. 3. *waitingOnSegmentAllocation* - In *CommitLogMetricsTest.java* test-case is added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding multiple mutations so that waitingOnSegmentAllocation metric updated while creating new segments. In normal case , waitingOnSegmentAllocation it will be zero, if during test-execution, waitingOnSegmentAllocation was not updated, then manually updating it. > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Test and Documentation Plan: CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases 1. For below metrics, test-case are added in +*CommitLogTest.java*+ . In test-cases, mutations are added, which internally update metrics. * *completedTasks* * *totalCommitLogSize* * *waitingOnCommit* 2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added testPendingTasks test-case, since pendingTasks metric are incremented and decremented in single method of *AbstractCommitLogService::maybeWaitForSync* . So introduced a dummy method incrementPendingTaks in *FakeCommitLogService-class,* to manually update pendingTasks metric. 3. *waitingOnSegmentAllocation* - In *CommitLogMetricsTest.java* test-case is added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding multiple mutations so that waitingOnSegmentAllocation metric updated while creating new segments. In normal case , waitingOnSegmentAllocation it will be zero, if during test-execution, waitingOnSegmentAllocation was not updated, then manually updating it. Status: Patch Available (was: In Progress) > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231559#comment-17231559 ] Sam Tunnicliffe commented on CASSANDRA-15299: - [~aholmber] that's great, thanks! I'll make sure everything's passing against trunk with the latest driver and get back to you. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231545#comment-17231545 ] Adam Holmberg edited comment on CASSANDRA-15299 at 11/13/20, 2:57 PM: -- {quote}This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch ... TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place ...{quote} The {{cassandra-test}} branch was created explicitly to allow server-side testing of merged client-impacting features, independent of driver releases. It was born of a time when there were multiple client-impacting changes in flight so a single commit would not suffice. Normally we would coordinate a merge into that driver branch with a PR for the server, so CI could be tested with it. Updating the branch should not be an issue. I, or [~aboudreault] can facilitate. was (Author: aholmber): {quote}This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch ... TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place ...{quote} The {{cassandra-test}} branch was created explicitly to allow server-side testing of merged client-impacting features, independent of driver releases. It was born of a time when there were multiple client-impacting changes in-flight so a single commit would not suffice. Normally we would coordinate a merge into that driver branch with a PR for the server, so CI could be tested with it. Updating the branch should not be an issue. I, or [~aboudreault] can facilitate. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231545#comment-17231545 ] Adam Holmberg commented on CASSANDRA-15299: --- {quote}This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch ... TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place ...{quote} The {{cassandra-test}} branch was created explicitly to allow server-side testing of merged client-impacting features, independent of driver releases. It was born of a time when there were multiple client-impacting changes in-flight so a single commit would not suffice. Normally we would coordinate a merge into that driver branch with a PR for the server, so CI could be tested with it. Updating the branch should not be an issue. I, or [~aboudreault] can facilitate. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231525#comment-17231525 ] Michael Semb Wever commented on CASSANDRA-15299: bq. Absolutely, I published them myself to test, but I definitely think we should have them under /apache. Done, ref: INFRA-21103. Let's see how that goes. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231489#comment-17231489 ] Sam Tunnicliffe commented on CASSANDRA-15299: - {quote}If we're updating to use a new updated version of the driver, does that mean the cassandra-test branch is being sync'd up to master in the progress?{quote} This is the thing I'm unsure about - the python driver is not an asf project (yet), so it's not really up to me/us whether we update that branch (of course I can open a PR to the driver to do that). Binding cassandra-dtest to a specific published commit is something wholly within our wheelhouse though, so that was the route I took. TBH, I don't recall the full reasoning for using the cassandra-test branch in the first place (backports I suppose). {quote}Is it time to start deploying these images under apache/ ? If agreed, I can open an infra ticket to set up deployment of docker images.{quote} Absolutely, I published them myself to test, but I definitely think we should have them under /apache. > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16246) Unexpected warning "Ignoring Unrecognized strategy option" for NetworkTopologyStrategy when restarting
[ https://issues.apache.org/jira/browse/CASSANDRA-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-16246: Fix Version/s: (was: 4.0-beta) 4.0-beta4 Since Version: 3.0.0 Source Control Link: https://github.com/apache/cassandra/commit/fde640fe52704836ec21fedd62cae21290e099ec Resolution: Fixed Status: Resolved (was: Ready to Commit) Committed to trunk in {{fde640fe52704836ec21fedd62cae21290e099ec}}, thanks! > Unexpected warning "Ignoring Unrecognized strategy option" for > NetworkTopologyStrategy when restarting > -- > > Key: CASSANDRA-16246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16246 > Project: Cassandra > Issue Type: Bug > Components: Observability/Logging >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0-beta4 > > Time Spent: 10m > Remaining Estimate: 0h > > During restarting, bunch of warning messages like > "AbstractReplicationStrategy.java:364 - Ignoring Unrecognized strategy option > {datacenter2} passed to NetworkTopologyStrategy for keyspace > distributed_test_keyspace" are logged. > The warnings are not expected since the mentioned DC exist. > It seems to be caused by the improper order during startup, so that when > opening keyspaces it is unaware of DCs. > The warning can be reproduced using the test below. > {code:java} > @Test > public void testEmitsWarningsForNetworkTopologyStategyConfigOnRestart() > throws Exception { > int nodesPerDc = 2; > try (Cluster cluster = builder().withConfig(c -> c.with(GOSSIP, NETWORK)) > .withRacks(2, 1, nodesPerDc) > .start()) { > cluster.schemaChange("CREATE KEYSPACE " + KEYSPACE + > " WITH replication = {'class': > 'NetworkTopologyStrategy', " + > "'datacenter1' : " + nodesPerDc + ", > 'datacenter2' : " + nodesPerDc + " };"); > cluster.get(2).nodetool("flush"); > System.out.println("Stop node 2 in datacenter 1"); > cluster.get(2).shutdown().get(); > System.out.println("Start node 2 in datacenter 1"); > cluster.get(2).startup(); > List result = cluster.get(2).logs().grep("Ignoring > Unrecognized strategy option \\{datacenter2\\}").getResult(); > Assert.assertFalse(result.isEmpty()); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16246) Unexpected warning "Ignoring Unrecognized strategy option" for NetworkTopologyStrategy when restarting
[ https://issues.apache.org/jira/browse/CASSANDRA-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-16246: Status: Ready to Commit (was: Changes Suggested) Thanks, LGTM > Unexpected warning "Ignoring Unrecognized strategy option" for > NetworkTopologyStrategy when restarting > -- > > Key: CASSANDRA-16246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16246 > Project: Cassandra > Issue Type: Bug > Components: Observability/Logging >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > During restarting, bunch of warning messages like > "AbstractReplicationStrategy.java:364 - Ignoring Unrecognized strategy option > {datacenter2} passed to NetworkTopologyStrategy for keyspace > distributed_test_keyspace" are logged. > The warnings are not expected since the mentioned DC exist. > It seems to be caused by the improper order during startup, so that when > opening keyspaces it is unaware of DCs. > The warning can be reproduced using the test below. > {code:java} > @Test > public void testEmitsWarningsForNetworkTopologyStategyConfigOnRestart() > throws Exception { > int nodesPerDc = 2; > try (Cluster cluster = builder().withConfig(c -> c.with(GOSSIP, NETWORK)) > .withRacks(2, 1, nodesPerDc) > .start()) { > cluster.schemaChange("CREATE KEYSPACE " + KEYSPACE + > " WITH replication = {'class': > 'NetworkTopologyStrategy', " + > "'datacenter1' : " + nodesPerDc + ", > 'datacenter2' : " + nodesPerDc + " };"); > cluster.get(2).nodetool("flush"); > System.out.println("Stop node 2 in datacenter 1"); > cluster.get(2).shutdown().get(); > System.out.println("Start node 2 in datacenter 1"); > cluster.get(2).startup(); > List result = cluster.get(2).logs().grep("Ignoring > Unrecognized strategy option \\{datacenter2\\}").getResult(); > Assert.assertFalse(result.isEmpty()); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] branch trunk updated: Add saved Host IDs to TokenMetadata during startup
This is an automated email from the ASF dual-hosted git repository. samt pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git The following commit(s) were added to refs/heads/trunk by this push: new fde640f Add saved Host IDs to TokenMetadata during startup fde640f is described below commit fde640fe52704836ec21fedd62cae21290e099ec Author: yifan-c AuthorDate: Thu Nov 5 17:54:11 2020 -0800 Add saved Host IDs to TokenMetadata during startup Patch by Yifan Cai; reviewed by Sam Tunnicliffe for CASSANDRA-16246 --- CHANGES.txt| 1 + .../apache/cassandra/service/StorageService.java | 73 ++ .../cassandra/distributed/impl/Instance.java | 9 +++ .../distributed/test/NetworkTopologyTest.java | 26 +++- 4 files changed, 68 insertions(+), 41 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index cb4d5bc..cbcc091 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0-beta4 + * Add saved Host IDs to TokenMetadata at startup (CASSANDRA-16246) * Ensure that CacheMetrics.requests is picked up by the metric reporter (CASSANDRA-16228) * Add a ratelimiter to snapshot creation and deletion (CASSANDRA-13019) * Produce consistent tombstone for reads to avoid digest mistmatch (CASSANDRA-15369) diff --git a/src/java/org/apache/cassandra/service/StorageService.java b/src/java/org/apache/cassandra/service/StorageService.java index 4a3477c..3201d80 100644 --- a/src/java/org/apache/cassandra/service/StorageService.java +++ b/src/java/org/apache/cassandra/service/StorageService.java @@ -637,21 +637,6 @@ public class StorageService extends NotificationBroadcasterSupport implements IE MessagingService.instance().listen(); } -public void populateTokenMetadata() -{ -if (Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true"))) -{ -logger.info("Populating token metadata from system tables"); -Multimap loadedTokens = SystemKeyspace.loadTokens(); -if (!shouldBootstrap()) // if we have not completed bootstrapping, we should not add ourselves as a normal token -loadedTokens.putAll(FBUtilities.getBroadcastAddressAndPort(), SystemKeyspace.getSavedTokens()); -for (InetAddressAndPort ep : loadedTokens.keySet()) -tokenMetadata.updateNormalTokens(loadedTokens.get(ep), ep); - -logger.info("Token metadata: {}", tokenMetadata); -} -} - public synchronized void initServer() throws ConfigurationException { initServer(RING_DELAY); @@ -676,6 +661,14 @@ public class StorageService extends NotificationBroadcasterSupport implements IE throw new AssertionError(e); } +if (Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true"))) +{ +logger.info("Loading persisted ring state"); +populatePeerTokenMetadata(); +for (InetAddressAndPort endpoint : tokenMetadata.getAllEndpoints()) +Gossiper.runInGossipStageBlocking(() -> Gossiper.instance.addSavedEndpoint(endpoint)); +} + // daemon threads, like our executors', continue to run while shutdown hooks are invoked drainOnShutdown = NamedThreadFactory.createThread(new WrappedRunnable() { @@ -697,8 +690,6 @@ public class StorageService extends NotificationBroadcasterSupport implements IE if (!Boolean.parseBoolean(System.getProperty("cassandra.start_gossip", "true"))) { logger.info("Not starting gossip as requested."); -// load ring state in preparation for starting gossip later -loadRingState(); initialized = true; return; } @@ -740,27 +731,34 @@ public class StorageService extends NotificationBroadcasterSupport implements IE initialized = true; } -private void loadRingState() +public void populateTokenMetadata() { if (Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true"))) { -logger.info("Loading persisted ring state"); -Multimap loadedTokens = SystemKeyspace.loadTokens(); -Map loadedHostIds = SystemKeyspace.loadHostIds(); -for (InetAddressAndPort ep : loadedTokens.keySet()) -{ -if (ep.equals(FBUtilities.getBroadcastAddressAndPort())) -{ -// entry has been mistakenly added, delete it -SystemKeyspace.removeEndpoint(ep); -} -else -{ -if (loadedHostIds.containsKey(ep)) -tokenMetadata.updateHostId(loadedHostIds.get(ep), ep); -Gossiper.runInGossipStageBlocking(() -> Gossiper.instance.addSavedEndpoint(ep)); -}
[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231462#comment-17231462 ] Michael Semb Wever edited comment on CASSANDRA-15299 at 11/13/20, 1:48 PM: --- bq. One thing that's a bit concerning is that the cassandra-test branch of the driver, which is what dtests are currently using, is currently 693 commits behind the master branch. If we're updating to use a new updated version of the driver, does that mean the {{cassandra-test}} branch is being sync'd up to master in the progress? {quote}Docker images: - beobal/cassandra-testing-ubuntu1910-java11:2020 - beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote} Is it time to start deploying these images under [{{apache/}}|https://hub.docker.com/u/apache] ? If agreed, I can open an infra ticket to set up deployment of docker images. bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any further here. Go for it! :-) was (Author: michaelsembwever): bq. One thing that's a bit concerning is that the cassandra-test branch of the driver, which is what dtests are currently using, is currently 693 commits behind the master branch. If we're updating to use a new updated version of the driver, does that mean the {{cassandra-test}} branch being sync'd up to master in the progress? {quote}Docker images: - beobal/cassandra-testing-ubuntu1910-java11:2020 - beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote} Is it time to start deploying these images under [{{apache/}}|https://hub.docker.com/u/apache] ? If agreed, I can open an infra ticket to set up deployment of docker images. bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any further here. Go for it! :-) > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and >
[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta
[ https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231462#comment-17231462 ] Michael Semb Wever commented on CASSANDRA-15299: bq. One thing that's a bit concerning is that the cassandra-test branch of the driver, which is what dtests are currently using, is currently 693 commits behind the master branch. If we're updating to use a new updated version of the driver, does that mean the {{cassandra-test}} branch being sync'd up to master in the progress? {quote}Docker images: - beobal/cassandra-testing-ubuntu1910-java11:2020 - beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote} Is it time to start deploying these images under [{{apache/}}|https://hub.docker.com/u/apache] ? If agreed, I can open an infra ticket to set up deployment of docker images. bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any further here. Go for it! :-) > CASSANDRA-13304 follow-up: improve checksumming and compression in protocol > v5-beta > --- > > Key: CASSANDRA-15299 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15299 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Aleksey Yeschenko >Assignee: Sam Tunnicliffe >Priority: Normal > Labels: protocolv5 > Fix For: 4.0-alpha > > Attachments: Process CQL Frame.png, V5 Flow Chart.png > > > CASSANDRA-13304 made an important improvement to our native protocol: it > introduced checksumming/CRC32 to request and response bodies. It’s an > important step forward, but it doesn’t cover the entire stream. In > particular, the message header is not covered by a checksum or a crc, which > poses a correctness issue if, for example, {{streamId}} gets corrupted. > Additionally, we aren’t quite using CRC32 correctly, in two ways: > 1. We are calculating the CRC32 of the *decompressed* value instead of > computing the CRC32 on the bytes written on the wire - losing the properties > of the CRC32. In some cases, due to this sequencing, attempting to decompress > a corrupt stream can cause a segfault by LZ4. > 2. When using CRC32, the CRC32 value is written in the incorrect byte order, > also losing some of the protections. > See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for > explanation for the two points above. > Separately, there are some long-standing issues with the protocol - since > *way* before CASSANDRA-13304. Importantly, both checksumming and compression > operate on individual message bodies rather than frames of multiple complete > messages. In reality, this has several important additional downsides. To > name a couple: > # For compression, we are getting poor compression ratios for smaller > messages - when operating on tiny sequences of bytes. In reality, for most > small requests and responses we are discarding the compressed value as it’d > be smaller than the uncompressed one - incurring both redundant allocations > and compressions. > # For checksumming and CRC32 we pay a high overhead price for small messages. > 4 bytes extra is *a lot* for an empty write response, for example. > To address the correctness issue of {{streamId}} not being covered by the > checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we > should switch to a framing protocol with multiple messages in a single frame. > I suggest we reuse the framing protocol recently implemented for internode > messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, > and that we do it before native protocol v5 graduates from beta. See > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java > and > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14477) The check of num_tokens against the length of inital_token in the yaml triggers unexpectedly
[ https://issues.apache.org/jira/browse/CASSANDRA-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-14477: --- Status: Changes Suggested (was: Review In Progress) > The check of num_tokens against the length of inital_token in the yaml > triggers unexpectedly > > > Key: CASSANDRA-14477 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14477 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Vincent White >Assignee: Stefan Miklosovic >Priority: Low > Time Spent: 40m > Remaining Estimate: 0h > > In CASSANDRA-10120 we added a check that compares num_tokens against the > number of tokens supplied in the yaml via initial_token. From my reading of > CASSANDRA-10120 it was to prevent cassandra starting if the yaml contained > contradictory values for num_tokens and initial_tokens which should help > prevent misconfiguration via human error. The current behaviour appears to > differ slightly in that it performs this comparison regardless of whether > num_tokens is included in the yaml or not. Below are proposed patches to only > perform the check if both options are present in the yaml. > ||Branch|| > |[3.0.x|https://github.com/apache/cassandra/compare/cassandra-3.0...vincewhite:num_tokens_30]| > |[3.x|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:num_tokens_test_1_311]| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231443#comment-17231443 ] Michael Semb Wever commented on CASSANDRA-16201: Out of the ticket scope… why are the microbench classes all in the {{org.apache.cassandra.test.microbench}} ? They are already separate under {{src/testmicrobench/}}, and by re-packaging them like this it means accessed methods: eg {{bs.getMutations(..)}} ; have to be made public instead of package-protected. It would be nice to keep methods package-protected where possible. AFAIK we also don't CI run the microbench classes anywhere, so there's no guarantee they remain runnable over time. I could add them to the ci-cassandra pipeline, though ideally a dedicated bare-metal server would be needed to make [use|https://plugins.jenkins.io/jmh-report/] of the runtime [reports|https://www.jenkins.io/blog/2019/06/21/performance-testing-jenkins/]. +1 on all branch patches (including [~yifanc] review comments above). > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231424#comment-17231424 ] Krishna Vadali commented on CASSANDRA-16271: My bad forgot attaching the diff, added it now. > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > Attachments: sleep_before_replace.diff > > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Vadali updated CASSANDRA-16271: --- Attachment: sleep_before_replace.diff > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > Attachments: sleep_before_replace.diff > > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16266) Stress testing a mixed cluster with C* 2.1.0 (seed) and 2.0.0 causes NPE
[ https://issues.apache.org/jira/browse/CASSANDRA-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231420#comment-17231420 ] Brandon Williams commented on CASSANDRA-16266: -- Thank you for your detailed analysis, it will be very helpful in future versions. > Stress testing a mixed cluster with C* 2.1.0 (seed) and 2.0.0 causes NPE > > > Key: CASSANDRA-16266 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16266 > Project: Cassandra > Issue Type: Bug >Reporter: Yongle Zhang >Priority: Normal > > Steps to reproduce: > # setup a mixed cluster with C* 2.1.0 (seed node) and C* 2.0.0 > # run the stress testing tool, e.g., > {code:java} > /cassandra/tools/bin/cassandra-stress write n=1000 -rate threads=50 -node > 250.16.238.1,250.16.238.2{code} > NPE: > {code:java} > ERROR [InternalResponseStage:2] 2020-07-22 08:29:36,170 CassandraDaemon.java > (line 186) Exception in thread Thread[InternalResponseStage:2,5,main] > java.lang.NullPointerException > at > org.apache.cassandra.serializers.BooleanSerializer.deserialize(BooleanSerializer.java:33) > at > org.apache.cassandra.serializers.BooleanSerializer.deserialize(BooleanSerializer.java:24) > at > org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142) > at > org.apache.cassandra.cql3.UntypedResultSet$Row.getBoolean(UntypedResultSet.java:106) > at > org.apache.cassandra.config.CFMetaData.fromSchemaNoColumnsNoTriggers(CFMetaData.java:1555) > at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1642) > at > org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:305) > at > org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:270) > at org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:183) > at > org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:66) > at > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:46) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Root cause: incompatible data > In the `CFMetaData` class of version 2.0.0, there is a boolean field named > `replicate_on_write`. In the same class of version 2.1.0, however, this field > no longer exists. When serializing this class in function > `toSchemaNoColumnsNoTriggers`, it will first write all of its fields into a > `RowMutation` (in 2.0.0) / `Mutation` (in 2.1.0) class, and then serialize > this “Mutation” like class in the same way. In 2.0.0 the `replicate_on_write` > field gets serialized at > [https://github.com/apache/cassandra/blob/03045ca22b11b0e5fc85c4fabd83ce6121b5709b/src/java/org/apache/cassandra/config/CFMetaData.java#L1514] > . > When deserializing this class in function `fromSchemaNoColumnsNoTriggers`, it > reads all its fields from a map-like class `UntypedResultSet.Row`. In 2.0.0 > the `replicate_on_write` field gets deserialized at > [https://github.com/apache/cassandra/blob/03045ca22b11b0e5fc85c4fabd83ce6121b5709b/src/java/org/apache/cassandra/config/CFMetaData.java#L1555] > . > The problem is that the existence of the key is not checked, and the map > returns a `null` value because the message from 2.1.0 doesn’t contain the > `replicate_on_write` key, which leads to the NullPointerException. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231411#comment-17231411 ] Krishna Vadali commented on CASSANDRA-16271: Thanks [~samt] looking forward to your patch. > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Vadali updated CASSANDRA-16271: --- Reviewers: Krishna Vadali, Paulo Motta (was: Paulo Motta) > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paulo Motta updated CASSANDRA-16271: Reviewers: Paulo Motta > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231406#comment-17231406 ] Paulo Motta commented on CASSANDRA-16271: - Thanks for taking this Sam. I'd be happy to review it as I'm also familiar with this issue. > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe reassigned CASSANDRA-16271: --- Assignee: Sam Tunnicliffe > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Assignee: Sam Tunnicliffe >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-16271: Status: Open (was: Triage Needed) > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231405#comment-17231405 ] Sam Tunnicliffe commented on CASSANDRA-16271: - I'm pretty sure this is fixed in trunk as a side effect of the rework to replication for Transient Replication (CASSANDRA-14404). I'm familiar with this issue, so I'll try and post a patch shortly. > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace
[ https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paulo Motta updated CASSANDRA-16271: Bug Category: Parent values: Correctness(12982)Level 1 values: API / Semantic Implementation(12988) Complexity: Normal Discovered By: User Report Severity: Normal Since Version: 2.2.8 > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace > > > Key: CASSANDRA-16271 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16271 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Krishna Vadali >Priority: Normal > > Writes timeout instead of failing on cluster with CL-1 replicas available > during replace node operation. > With Consistency Level ALL, we are observing Timeout exceptions during writes > when (RF - 1) nodes are available in the cluster with one replace-node > operation running. The coordinator is expecting RF + 1 responses, while there > are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the > cluster, hence timing out. > The same problem happens on a keyspace with RF=1, CL=ONE and one replica > being replaced. Also RF=3, CL=QUORUM, one replica down and another being > replaced. > I believe the expected behavior is that the write should fail with > UnavailableException since there are not enough NORMAL replicas to fulfill > the request. > h4. *Steps to reproduce:* > Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 > (127.0.0.2), node3 (127.0.0.3)): > {code:java} > ccm create test -v 3.11.3 -n 3 -s > {code} > Create test keyspaces with RF = 3 and RF = 1 respectively: > {code:java} > create keyspace rf3 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 3}; > create keyspace rf1 with replication = \{'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > Create a table test in both the keyspaces: > {code:java} > create table rf3.test ( pk int primary KEY, value int); > create table rf1.test ( pk int primary KEY, value int); > {code} > Stop node node2: > {code:java} > ccm node2 stop > {code} > Create node node4: > {code:java} > ccm add node4 -i 127.0.0.4 > {code} > Enable auto_bootstrap > {code:java} > ccm node4 updateconf 'auto_bootstrap: true' > {code} > Ensure node4 does not have itself in its seeds list. > Run a replace node to replace node2 (address 127.0.0.2 corresponds to node > node2) > {code:java} > ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2" > {code} > When the replace node is running, perform write/reads with CONSISTENCY ALL, > we observed TimeoutException. > {code:java} > SET CONSISTENCY ALL:SET CONSISTENCY ALL: > cqlsh> insert into rf3.test (pk, value) values (16, 7); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, > 'consistency': 'ALL'}{code} > {code:java} > cqlsh> CONSISTENCY ONE; > cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); > WriteTimeout: Error from server: code=1100 [Coordinator node timed out > waiting for replica nodes' responses] message="Operation timed out - received > only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, > 'consistency': 'ONE'} > {code} > Cluster State: > {code:java} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens Owns (effective) Host ID > Rack > UN 127.0.0.1 70.45 KiB 1100.0% > 4f652b22-045b-493b-8722-fb5f7e1723ce rack1 > UN 127.0.0.3 70.43 KiB 1100.0% > a0dcd677-bdb3-4947-b9a7-14f3686a709f rack1 > UJ 127.0.0.4 137.47 KiB 1? > e3d794f1-081e-4aba-94f2-31950c713846 rack1 > {code} > Note: > We introduced sleep during replace operation in order to simulate do our > experiments. We attached code diff that does it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16013) sstablescrub unit test hardening and docs improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Berenguer Blasi updated CASSANDRA-16013: Summary: sstablescrub unit test hardening and docs improvements (was: sstablescrub unit test hardening an docs improvements) > sstablescrub unit test hardening and docs improvements > -- > > Key: CASSANDRA-16013 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16013 > Project: Cassandra > Issue Type: Bug > Components: Tool/sstable >Reporter: Berenguer Blasi >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0 > > Time Spent: 20m > Remaining Estimate: 0h > > During CASSANDRA-15883 / CASSANDRA-15991 it was detected unit test coverage > for this tool is minimal. There is a unit test to enhance upon under > {{test/unit/org/apache/cassandra/tools}}. Also docs need updating to reflect > the latest options available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231396#comment-17231396 ] Marcus Eriksson commented on CASSANDRA-15580: - should probably wait for CASSANDRA-16274 before testing with {{-os}} > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231382#comment-17231382 ] Yasar Arafath Baigh commented on CASSANDRA-16185: - CommitLogMetrics Test-case patch is attached. > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15957) org.apache.cassandra.repair.RepairJobTest testOptimizedCreateStandardSyncTasks
[ https://issues.apache.org/jira/browse/CASSANDRA-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-15957: Resolution: Fixed Status: Resolved (was: Open) CASSANDRA-16274 contains a fix for this > org.apache.cassandra.repair.RepairJobTest testOptimizedCreateStandardSyncTasks > -- > > Key: CASSANDRA-15957 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15957 > Project: Cassandra > Issue Type: Bug > Components: Test/unit >Reporter: David Capwell >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 4.0-beta > > > Build: > https://ci-cassandra.apache.org/job/Cassandra-trunk-test/lastCompletedBuild/testReport/junit/org.apache.cassandra.repair/RepairJobTest/testOptimizedCreateStandardSyncTasks/ > Expecting: > <[#, >#]> > to contain only: > <[(,0001]]> > but the following elements were unexpected: > <[#]> > This failed 3 times in a row on Jenkins -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231378#comment-17231378 ] Marcus Eriksson edited comment on CASSANDRA-16274 at 11/13/20, 11:02 AM: - patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt cci: https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt A few commits in this, basic idea for the optimisations is to only iterate over the ranges that can overlap instead of all diffing ranges. When picking endpoints to stream from, we always pick the next node sorted by ip address - does not matter which node we pick as long as we pick the same one. This branch also contains a fix for CASSANDRA-15957 was (Author: krummas): patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt cci: https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt > Improve performance when calculating StreamTasks with optimised streaming > - > > Key: CASSANDRA-16274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16274 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 4.0-beta4 > > > The way stream tasks are calculated currently is quite inefficient, improve > that. > Also, we currently try to distribute the streaming nodes evenly, this creates > many more sstables than necessary - instead we should try to stream > everything from a single peer, this should reduce the number of sstables > created on the out-of-sync node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231379#comment-17231379 ] Benjamin Lerer edited comment on CASSANDRA-16259 at 11/13/20, 11:02 AM: {quote}If I understand the change within CASSANDRA-15164 right, then the storage format of SSTable statistics has changed, which should also bump the SSTable version, shouldn't it?{quote} The storage format did not change. The encoding was already {{\*}}. The old code is able to read the statistics produced by the new one. The problem is only at the metric level where C* try to merge some SSTable histograms that have a different number of buckets with some buggy code. When you scrub the old SSTable, it is recreated with the new number of buckets ensuring that you will not hit the TableMetric bug again. was (Author: blerer): {quote}If I understand the change within CASSANDRA-15164 right, then the storage format of SSTable statistics has changed, which should also bump the SSTable version, shouldn't it?{quote} The storage format did not change. The encoding was already {{*}}. The old code is able to read the statistics produced by the new one. The problem is only at the metric level where C* try to merge some SSTable histograms that have a different number of buckets with some buggy code. When you scrub the old SSTable, it is recreated with the new number of buckets ensuring that you will not hit the TableMetric bug again. > tablehistograms cause ArrayIndexOutOfBoundsException > > > Key: CASSANDRA-16259 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16259 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Justin Montgomery >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta > > > After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error > appeared on the upgraded nodes when trying to access *tablehistograms*. The > same command run on our .8 nodes return as expected, only the upgraded .9 > nodes fail. Not all tables fail when queried, but about 90% of them do. > We use Datastax MCAC which appears to query histograms every 30 seconds, this > outputs to the system.log: > {noformat} > WARN [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - > Error reporting: > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626) > ~[datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798) > [datastax-mcac-agent.jar:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_272] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat} > Manually trying a histogram from the CLI: > {noformat} > $ nodetool tablehistograms logdata log_height_index > error: 115 > -- StackTrace -- > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > at >
[jira] [Commented] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231379#comment-17231379 ] Benjamin Lerer commented on CASSANDRA-16259: {quote}If I understand the change within CASSANDRA-15164 right, then the storage format of SSTable statistics has changed, which should also bump the SSTable version, shouldn't it?{quote} The storage format did not change. The encoding was already {{*}}. The old code is able to read the statistics produced by the new one. The problem is only at the metric level where C* try to merge some SSTable histograms that have a different number of buckets with some buggy code. When you scrub the old SSTable, it is recreated with the new number of buckets ensuring that you will not hit the TableMetric bug again. > tablehistograms cause ArrayIndexOutOfBoundsException > > > Key: CASSANDRA-16259 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16259 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Justin Montgomery >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta > > > After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error > appeared on the upgraded nodes when trying to access *tablehistograms*. The > same command run on our .8 nodes return as expected, only the upgraded .9 > nodes fail. Not all tables fail when queried, but about 90% of them do. > We use Datastax MCAC which appears to query histograms every 30 seconds, this > outputs to the system.log: > {noformat} > WARN [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - > Error reporting: > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626) > ~[datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798) > [datastax-mcac-agent.jar:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_272] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat} > Manually trying a histogram from the CLI: > {noformat} > $ nodetool tablehistograms logdata log_height_index > error: 115 > -- StackTrace -- > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > at > org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) > at > org.apache.cassandra.metrics.CassandraMetricsRegistry$JmxGauge.getValue(CassandraMetricsRegistry.java:250) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Updated] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16274: Test and Documentation Plan: new unit tests, jvm dtests Status: Patch Available (was: Open) patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt cci: https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt > Improve performance when calculating StreamTasks with optimised streaming > - > > Key: CASSANDRA-16274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16274 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 4.0-beta4 > > > The way stream tasks are calculated currently is quite inefficient, improve > that. > Also, we currently try to distribute the streaming nodes evenly, this creates > many more sstables than necessary - instead we should try to stream > everything from a single peer, this should reduce the number of sstables > created on the out-of-sync node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16274: Change Category: Performance Complexity: Normal Component/s: Consistency/Repair Fix Version/s: 4.0-beta4 Status: Open (was: Triage Needed) > Improve performance when calculating StreamTasks with optimised streaming > - > > Key: CASSANDRA-16274 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16274 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 4.0-beta4 > > > The way stream tasks are calculated currently is quite inefficient, improve > that. > Also, we currently try to distribute the streaming nodes evenly, this creates > many more sstables than necessary - instead we should try to stream > everything from a single peer, this should reduce the number of sstables > created on the out-of-sync node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming
Marcus Eriksson created CASSANDRA-16274: --- Summary: Improve performance when calculating StreamTasks with optimised streaming Key: CASSANDRA-16274 URL: https://issues.apache.org/jira/browse/CASSANDRA-16274 Project: Cassandra Issue Type: Improvement Reporter: Marcus Eriksson Assignee: Marcus Eriksson The way stream tasks are calculated currently is quite inefficient, improve that. Also, we currently try to distribute the streaming nodes evenly, this creates many more sstables than necessary - instead we should try to stream everything from a single peer, this should reduce the number of sstables created on the out-of-sync node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231372#comment-17231372 ] Benjamin Lerer commented on CASSANDRA-16189: I should have time next week for the review. Thanks. > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasar Arafath Baigh updated CASSANDRA-16185: Attachment: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > Add tests to cover CommitLog metrics > > > Key: CASSANDRA-16185 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16185 > Project: Cassandra > Issue Type: Improvement > Components: Test/unit >Reporter: Benjamin Lerer >Assignee: Yasar Arafath Baigh >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch > > > The only metrics that seems to be covered by unit test for the CommitLog > metrics is {{oversizedMutations}}. We should add testing the other ones. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231279#comment-17231279 ] Mohamed Zafraan commented on CASSANDRA-16189: - that's fine. do let me know if there's anything to do on my side. > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231240#comment-17231240 ] Benjamin Lerer commented on CASSANDRA-16189: Sorry, for the noise around the reviewer and status. I add to do some testing for INFRA-21091 and used that ticket for it. > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-16189: --- Reviewers: Benjamin Lerer (was: Adam Holmberg) > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-16189: --- Reviewers: Adam Holmberg Status: Review In Progress (was: Patch Available) > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-16189: --- Status: Patch Available (was: Review In Progress) > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231238#comment-17231238 ] Tibor Repasi commented on CASSANDRA-16259: -- Well, that would explain why scrubbing the table fixed it. If I understand the change within CASSANDRA-15164 right, then the storage format of SSTable statistics has changed, which should also bump the SSTable version, shouldn't it? > tablehistograms cause ArrayIndexOutOfBoundsException > > > Key: CASSANDRA-16259 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16259 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Justin Montgomery >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta > > > After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error > appeared on the upgraded nodes when trying to access *tablehistograms*. The > same command run on our .8 nodes return as expected, only the upgraded .9 > nodes fail. Not all tables fail when queried, but about 90% of them do. > We use Datastax MCAC which appears to query histograms every 30 seconds, this > outputs to the system.log: > {noformat} > WARN [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - > Error reporting: > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) > ~[apache-cassandra-3.11.9.jar:3.11.9] > at > com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626) > ~[datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) > [datastax-mcac-agent.jar:na] > at > com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798) > [datastax-mcac-agent.jar:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_272] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat} > Manually trying a histogram from the CLI: > {noformat} > $ nodetool tablehistograms logdata log_height_index > error: 115 > -- StackTrace -- > java.lang.ArrayIndexOutOfBoundsException: 115 > at > org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261) > at > org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) > at > org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) > at > org.apache.cassandra.metrics.CassandraMetricsRegistry$JmxGauge.getValue(CassandraMetricsRegistry.java:250) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-16189: --- Reviewers: (was: Benjamin Lerer) > Add tests for the Hint service metrics > -- > > Key: CASSANDRA-16189 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16189 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/python >Reporter: Benjamin Lerer >Assignee: Mohamed Zafraan >Priority: Normal > Fix For: 4.0-beta > > Attachments: 0001-added-hints-metrics-test.patch > > > There are currently no tests for the hint metrics -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15582) 4.0 quality testing: metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-15582: --- Description: The goal of this ticket is to have a proper testing of the different metrics exposed via JMX, and to ensure that metrics that are not in used in 4.0 have been properly deprecated. The following table show the current status of the metric tests and can be used to track the progress of that ticket: || Metrics || Status || test types || JIRA tickets || | Batch | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15718 | | BufferPool | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15773 | | Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 | | Client | {color:#DE350B}*TESTS MISSING*{color}| unit tests | CASSANDRA-16216 | | ClientRequest | {color:#00875A}*COVERED*{color} | in-jvm tests | CASSANDRA-16183 | | ClientRequestSize | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-16184 | | Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 | | CommitLog | {color:#DE350B}*TESTS MISSING*{color} | unit tests | CASSANDRA-16185 | | Compaction | {color:#DE350B}*TESTS MISSING*{color} | unit tests | CASSANDRA-16192 | | CQL | {color:#00875A}*COVERED*{color}| unit tests | | | HintService | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16189 | | Messaging/Internode| {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16193 | | ReadRepair| {color:#DE350B}*TESTS MISSING*{color} | dtests,in-jvm dtests | CASSANDRA-16187 | | Repair | {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16191 | | Storage | {color:#00875A}*COVERED*{color}| unit tests | | | Streaming | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16190 | | Keyspace | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | CASSANDRA-16188 | | Table | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | CASSANDRA-16188 | | ThreadPoolMetrics |{color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-16186 | was: The goal of this ticket is to have a proper testing of the different metrics exposed via JMX, and to ensure that metrics that are not in used in 4.0 have been properly deprecated. The following table show the current status of the metric tests and can be used to track the progress of that ticket: || Metrics || Status || test types || JIRA tickets || | Batch | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15718 | | BufferPool | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15773 | | Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 | | Client | {color:#DE350B}*TESTS MISSING*{color}| unit tests | CASSANDRA-16216 | | ClientRequest |{color:#DE350B}*NO TESTS*{color} | in-jvm tests | CASSANDRA-16183 | | ClientRequestSize | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-16184 | | Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 | | CommitLog | {color:#DE350B}*TESTS MISSING*{color} | unit tests | CASSANDRA-16185 | | Compaction | {color:#DE350B}*TESTS MISSING*{color} | unit tests | CASSANDRA-16192 | | CQL | {color:#00875A}*COVERED*{color}| unit tests | | | HintService | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16189 | | Messaging/Internode| {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16193 | | ReadRepair| {color:#DE350B}*TESTS MISSING*{color} | dtests,in-jvm dtests | CASSANDRA-16187 | | Repair | {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16191 | | Storage | {color:#00875A}*COVERED*{color}| unit tests | | | Streaming | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16190 | | Keyspace | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | CASSANDRA-16188 | | Table | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | CASSANDRA-16188 | | ThreadPoolMetrics |{color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-16186 | > 4.0 quality testing: metrics > > > Key: CASSANDRA-15582 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15582 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 4.0-beta > > Attachments: Screen Shot 2020-04-07 at 5.47.17 PM.png > > > The goal of this ticket is to have a proper testing of the different metrics > exposed via JMX, and to ensure that metrics that are not in used in 4.0 have > been properly deprecated. > The following table show the current status of the metric tests and can be > used to track the progress of that ticket: > || Metrics || Status || test types || JIRA tickets || > | Batch | {color:#00875A}*COVERED*{color} | unit tests |