[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231934#comment-17231934
 ] 

David Capwell commented on CASSANDRA-16213:
---

Finished assassinate and made sure to flesh out the different cases I could 
see.  org.apache.cassandra.gms.EndpointState#isEmpty does need to check for 
status in order for assassinate with this patch.

If you stop all nodes and bring up all but the host to remove, then assassinate 
the node to remove, it will still be "empty" based off version, but will have a 
status.  If we do not check the status when we check for empty, we would then 
treat this endpoint as normal and move on, which isn't correct as its in the 
LEFT state.

 

[~paulo] I added 
org.apache.cassandra.distributed.test.hostreplacement.AssassinatedEmptyNodeTest 
to flesh this case out if you want to take a closer look. EndpointState.isEmpty 
is only use in one spot now since we removed the filter, so feel its still best 
to check the state to make sure it is this specific case.

> Cannot replace_address /X because it doesn't exist in gossip
> 
>
> Key: CASSANDRA-16213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Membership
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

2020-11-13 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231885#comment-17231885
 ] 

Caleb Rackliffe commented on CASSANDRA-16181:
-

[~adelapena] I think I've gotten to the point of diminishing returns on the 
summary doc, so take a look when you have a chance. Aside from some minor 
filling out of some of the unit tests, I think the biggest thing I'd want to do 
outside of CASSANDRA-16262 is creating a more comprehensive upgrade test along 
the lines of what you did for the mixed mode read repair tests and "scenario 
modeled" after {{TestReplication}} in {{replication_test.py}}. It probably 
won't be too hard to dynamically execute local reads across a fairly small 
dataset to verify that things are being replicated to the right places.

WDYT?

> 4.0 Quality: Replication Test Audit
> ---
>
> Key: CASSANDRA-16181
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Andres de la Peña
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is 
> [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
>  We should identify which other tests cover this and identify what should be 
> extended, similarly to what has been done with CASSANDRA-15977.
> The doc 
> [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
>  describes the existing state of testing around replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

2020-11-13 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16181:

Description: 
This is a subtask of CASSANDRA-15579 focusing on replication.

I think that the main reference dtest for this is 
[replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
 We should identify which other tests cover this and identify what should be 
extended, similarly to what has been done with CASSANDRA-15977.

The doc 
[here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
 describes the existing state of testing around replication.

  was:
This is a subtask of CASSANDRA-15579 focusing on replication.

I think that the main reference dtest for this is 
[replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
 We should identify which other tests cover this and identify what should be 
extended, similarly to what has been done with CASSANDRA-15977.

The (WIP) doc 
[here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
 describes the existing state of testing around replication.


> 4.0 Quality: Replication Test Audit
> ---
>
> Key: CASSANDRA-16181
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Andres de la Peña
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is 
> [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
>  We should identify which other tests cover this and identify what should be 
> extended, similarly to what has been done with CASSANDRA-15977.
> The doc 
> [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
>  describes the existing state of testing around replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16261) Prevent unbounded number of flushing tasks

2020-11-13 Thread Ekaterina Dimitrova (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-16261:

Test and Documentation Plan: 
https://issues.apache.org/jira/browse/CASSANDRA-16261?focusedCommentId=17231874=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17231874
 Status: Patch Available  (was: In Progress)

> Prevent unbounded number of flushing tasks
> --
>
> Key: CASSANDRA-16261
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16261
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 3.11.x, 4.0-beta4
>
>
> The cleaner thread is not prevented from queueing an unbounded number of 
> flushing tasks for memtables that are almost empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16261) Prevent unbounded number of flushing tasks

2020-11-13 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231874#comment-17231874
 ] 

Ekaterina Dimitrova commented on CASSANDRA-16261:
-

This patch puts a cap on the maximum number of flushing tasks that can be 
enqueued by the memtable cleaner thread.

This will have the consequence of creating larger sstables if flushing cannot 
keep up, so we must choose the maximum number of pending tasks carefully. At 
the moment it is [configurable 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-16261-trunk/src/java/org/apache/cassandra/db/Memtable.java#L76]
 and set to twice the number of flush writers.
 When a memtable gets into discarding state, all pending updates update both 
used and reclaiming.

[trunk|https://github.com/ekaterinadimitrova2/cassandra/pull/76/commits/6ffa1802a1cfb8420db8a253ae5312fcffddfd6a]
 | [JAVA8 CI 
|https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/506/workflows/4411817d-bd4d-449a-b28e-8f9616eaf1f4]
 | [JAVA 11 CI 
|https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/506/workflows/9646169b-26bd-4d5b-aa5b-ba7d6522786d]

[3.11|https://github.com/ekaterinadimitrova2/cassandra/commit/1f0d02d0a8d04524a99574dc60c0f4b215520591]
 | [CI 
|https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/508/workflows/78295ba3-6a08-4e04-a39e-cf46a7913d02]
 

No new test failures, one more test class added - [MemtableCleanerThreadTest 
|https://github.com/ekaterinadimitrova2/cassandra/pull/76/commits/6ffa1802a1cfb8420db8a253ae5312fcffddfd6a#diff-ef05bf02f6f0b3ab7db707faeb3a6c0e69f67d330de07ca0876cccaf1ea9395fR44].

[~adelapena] do you mind to review it?
  

> Prevent unbounded number of flushing tasks
> --
>
> Key: CASSANDRA-16261
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16261
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 3.11.x, 4.0-beta4
>
>
> The cleaner thread is not prevented from queueing an unbounded number of 
> flushing tasks for memtables that are almost empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-4938) CREATE INDEX can block for creation now that schema changes may be concurrent

2020-11-13 Thread Kirk True (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True reassigned CASSANDRA-4938:


Assignee: Kirk True

> CREATE INDEX can block for creation now that schema changes may be concurrent
> -
>
> Key: CASSANDRA-4938
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4938
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/2i Index
>Reporter: Krzysztof Cieslinski Cognitum
>Assignee: Kirk True
>Priority: Low
>  Labels: lhf
> Fix For: 4.x
>
>
> Response from CREATE INDEX command comes faster than the creation of 
> secondary index. So below code:
> {code:xml}
> CREATE INDEX ON tab(name);
> SELECT * FROM tab WHERE name = 'Chris';
> {code}
> doesn't return any rows(of course, in column family "tab", there are some 
> records with "name" value = 'Chris'..) and any errors ( i would expect 
> something like ??"Bad Request: No indexed columns present in by-columns 
> clause with Equal operator"??) 
> Inputing some timeout between those two commands resolves the problem, so:
> {code:xml}
> CREATE INDEX ON tab(name);
> Sleep(timeout); // for column family with 2000 rows the timeout had to be set 
> for ~1 second 
> SELECT * FROM tab WHERE name = 'Chris';
> {code}
> will return all rows with values as specified.
> I'm using single node cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

2020-11-13 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231864#comment-17231864
 ] 

Caleb Rackliffe commented on CASSANDRA-16181:
-

[~adelapena] Created a PR [here|https://github.com/apache/cassandra/pull/821] 
to track the little things I've been tinkering w/ here and there. (No urgent 
need to review...)

> 4.0 Quality: Replication Test Audit
> ---
>
> Key: CASSANDRA-16181
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Andres de la Peña
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is 
> [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
>  We should identify which other tests cover this and identify what should be 
> extended, similarly to what has been done with CASSANDRA-15977.
> The (WIP) doc 
> [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
>  describes the existing state of testing around replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231832#comment-17231832
 ] 

David Capwell commented on CASSANDRA-16213:
---

[~paulo] added the test and rebased to latest trunk as 2 recent commits impact 
this logic.

 

I am going to run the tests in a loop to make sure they are not flaky, if they 
are will split the class files or change the bootstrap schema properties.

 

The last thing on my plate is to validate assassinate; forgot to do this.

> Cannot replace_address /X because it doesn't exist in gossip
> 
>
> Key: CASSANDRA-16213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Membership
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231814#comment-17231814
 ] 

David Capwell commented on CASSANDRA-15158:
---

Committed 
https://github.com/apache/cassandra/commit/7d6f9b94dd0d00bfd29374d7a645e650f451023d

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782
 ] 

David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:38 PM:
--

Starting commit

CI Results: Yellow.  3.1 org.apache.cassandra.service.MigrationCoordinatorTest 
but passes locally, -trunk 
org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due 
to schemas not present added commit which increases timeout from 30s to 90s-, 
and other expected issues.
||Branch||Source||Circle CI||Jenkins||
|cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]|
|cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]|



was (Author: dcapwell):
Starting commit

CI Results: Yellow.  3.1 org.apache.cassandra.service.MigrationCoordinatorTest 
but passes locally, trunk 
org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due 
to schemas not present added commit which increases timeout from 30s to 90s, 
and other expected issues.
||Branch||Source||Circle CI||Jenkins||
|cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]|
|cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]|


> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the 

[cassandra] branch cassandra-3.11 updated (8ffa79f -> 50d8245)

2020-11-13 Thread dcapwell
This is an automated email from the ASF dual-hosted git repository.

dcapwell pushed a change to branch cassandra-3.11
in repository https://gitbox.apache.org/repos/asf/cassandra.git.


from 8ffa79f  Merge branch 'cassandra-3.0' into cassandra-3.11
 new 17ebee3  CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and 
no longer convert it from secones to millis (since its already millis)
 new 50d8245  Merge branch 'cassandra-3.0' into cassandra-3.11

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/cassandra/service/StorageService.java | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] branch trunk updated (94663c3 -> e4fac35)

2020-11-13 Thread dcapwell
This is an automated email from the ASF dual-hosted git repository.

dcapwell pushed a change to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git.


from 94663c3  Relax < check to <= for NodeToolGossipInfoTest
 new 17ebee3  CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and 
no longer convert it from secones to millis (since its already millis)
 new 50d8245  Merge branch 'cassandra-3.0' into cassandra-3.11
 new e4fac35  Merge branch 'cassandra-3.11' into trunk

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../apache/cassandra/config/CassandraRelevantProperties.java | 11 +++
 src/java/org/apache/cassandra/service/StorageService.java| 12 +++-
 .../apache/cassandra/distributed/action/GossipHelper.java|  7 ++-
 .../cassandra/distributed/test/ring/BootstrapTest.java   |  6 --
 4 files changed, 28 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] 01/01: Merge branch 'cassandra-3.0' into cassandra-3.11

2020-11-13 Thread dcapwell
This is an automated email from the ASF dual-hosted git repository.

dcapwell pushed a commit to branch cassandra-3.11
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit 50d8245d76aa76747f8bd6ae3947d22e5a02d290
Merge: 8ffa79f 17ebee3
Author: David Capwell 
AuthorDate: Fri Nov 13 12:36:31 2020 -0800

Merge branch 'cassandra-3.0' into cassandra-3.11

 src/java/org/apache/cassandra/service/StorageService.java | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --cc src/java/org/apache/cassandra/service/StorageService.java
index 734f176,a72b530..eb13df1
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@@ -149,7 -143,7 +149,7 @@@ public class StorageService extends Not
  String newdelay = System.getProperty("cassandra.schema_delay_ms");
  if (newdelay != null)
  {
--logger.info("Overriding SCHEMA_DELAY to {}ms", newdelay);
++logger.info("Overriding SCHEMA_DELAY_MILLIS to {}ms", newdelay);
  return Integer.parseInt(newdelay);
  }
  else


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] 01/01: Merge branch 'cassandra-3.11' into trunk

2020-11-13 Thread dcapwell
This is an automated email from the ASF dual-hosted git repository.

dcapwell pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git

commit e4fac3582e0a9dda182313a3aa784be35d965f4e
Merge: 94663c3 50d8245
Author: David Capwell 
AuthorDate: Fri Nov 13 12:37:46 2020 -0800

Merge branch 'cassandra-3.11' into trunk

 .../apache/cassandra/config/CassandraRelevantProperties.java | 11 +++
 src/java/org/apache/cassandra/service/StorageService.java| 12 +++-
 .../apache/cassandra/distributed/action/GossipHelper.java|  7 ++-
 .../cassandra/distributed/test/ring/BootstrapTest.java   |  6 --
 4 files changed, 28 insertions(+), 8 deletions(-)

diff --cc src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
index 881b7d9,000..7402aa1
mode 100644,00..100644
--- a/src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
+++ b/src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
@@@ -1,240 -1,0 +1,251 @@@
 +/*
 + * Licensed to the Apache Software Foundation (ASF) under one
 + * or more contributor license agreements.  See the NOTICE file
 + * distributed with this work for additional information
 + * regarding copyright ownership.  The ASF licenses this file
 + * to you under the Apache License, Version 2.0 (the
 + * "License"); you may not use this file except in compliance
 + * with the License.  You may obtain a copy of the License at
 + *
 + * http://www.apache.org/licenses/LICENSE-2.0
 + *
 + * Unless required by applicable law or agreed to in writing, software
 + * distributed under the License is distributed on an "AS IS" BASIS,
 + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 + * See the License for the specific language governing permissions and
 + * limitations under the License.
 + */
 +
 +package org.apache.cassandra.config;
 +
 +import org.apache.cassandra.exceptions.ConfigurationException;
 +
 +/** A class that extracts system properties for the cassandra node it runs 
within. */
 +public enum CassandraRelevantProperties
 +{
 +//base JVM properties
 +JAVA_HOME("java.home"),
 +CASSANDRA_PID_FILE ("cassandra-pidfile"),
 +
 +/**
 + * Indicates the temporary directory used by the Java Virtual Machine 
(JVM)
 + * to create and store temporary files.
 + */
 +JAVA_IO_TMPDIR ("java.io.tmpdir"),
 +
 +/**
 + * Path from which to load native libraries.
 + * Default is absolute path to lib directory.
 + */
 +JAVA_LIBRARY_PATH ("java.library.path"),
 +
 +JAVA_SECURITY_EGD ("java.security.egd"),
 +
 +/** Java Runtime Environment version */
 +JAVA_VERSION ("java.version"),
 +
 +/** Java Virtual Machine implementation name */
 +JAVA_VM_NAME ("java.vm.name"),
 +
 +/** Line separator ("\n" on UNIX). */
 +LINE_SEPARATOR ("line.separator"),
 +
 +/** Java class path. */
 +JAVA_CLASS_PATH ("java.class.path"),
 +
 +/** Operating system architecture. */
 +OS_ARCH ("os.arch"),
 +
 +/** Operating system name. */
 +OS_NAME ("os.name"),
 +
 +/** User's home directory. */
 +USER_HOME ("user.home"),
 +
 +/** Platform word size sun.arch.data.model. Examples: "32", "64", 
"unknown"*/
 +SUN_ARCH_DATA_MODEL ("sun.arch.data.model"),
 +
 +//JMX properties
 +/**
 + * The value of this property represents the host name string
 + * that should be associated with remote stubs for locally created remote 
objects,
 + * in order to allow clients to invoke methods on the remote object.
 + */
 +JAVA_RMI_SERVER_HOSTNAME ("java.rmi.server.hostname"),
 +
 +/**
 + * If this value is true, object identifiers for remote objects exported 
by this VM will be generated by using
 + * a cryptographically secure random number generator. The default value 
is false.
 + */
 +JAVA_RMI_SERVER_RANDOM_ID ("java.rmi.server.randomIDs"),
 +
 +/**
 + * This property indicates whether password authentication for remote 
monitoring is
 + * enabled. By default it is disabled - 
com.sun.management.jmxremote.authenticate
 + */
 +COM_SUN_MANAGEMENT_JMXREMOTE_AUTHENTICATE 
("com.sun.management.jmxremote.authenticate"),
 +
 +/**
 + * The port number to which the RMI connector will be bound - 
com.sun.management.jmxremote.rmi.port.
 + * An Integer object that represents the value of the second argument is 
returned
 + * if there is no port specified, if the port does not have the correct 
numeric format,
 + * or if the specified name is empty or null.
 + */
 +COM_SUN_MANAGEMENT_JMXREMOTE_RMI_PORT 
("com.sun.management.jmxremote.rmi.port", "0"),
 +
 +/** Cassandra jmx remote port */
 +CASSANDRA_JMX_REMOTE_PORT("cassandra.jmx.remote.port"),
 +
 +/** This property  indicates whether SSL is enabled for monitoring 
remotely. Default is set to false. */
 +COM_SUN_MANAGEMENT_JMXREMOTE_SSL 

[cassandra] branch cassandra-3.0 updated: CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer convert it from secones to millis (since its already millis)

2020-11-13 Thread dcapwell
This is an automated email from the ASF dual-hosted git repository.

dcapwell pushed a commit to branch cassandra-3.0
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/cassandra-3.0 by this push:
 new 17ebee3  CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and 
no longer convert it from secones to millis (since its already millis)
17ebee3 is described below

commit 17ebee3186d1bfdee9a2b355cb8f139492d144e8
Author: David Capwell 
AuthorDate: Fri Nov 13 11:18:55 2020 -0800

CASSANDRA-15158 fixed SCHEMA_DELAY to use getSchemaDelay and no longer 
convert it from secones to millis (since its already millis)

patch by David Capwell; reviewed by Blake Eggleston for CASSANDRA-15158
---
 src/java/org/apache/cassandra/service/StorageService.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/java/org/apache/cassandra/service/StorageService.java 
b/src/java/org/apache/cassandra/service/StorageService.java
index 3718e8c..a72b530 100644
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@ -113,7 +113,7 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 private static final Logger logger = 
LoggerFactory.getLogger(StorageService.class);
 
 public static final int RING_DELAY = getRingDelay(); // delay after which 
we assume ring has stablized
-public static final int SCHEMA_DELAY = getRingDelay(); // delay after 
which we assume ring has stablized
+public static final int SCHEMA_DELAY_MILLIS = getSchemaDelay();
 
 private static final boolean REQUIRE_SCHEMAS = 
!Boolean.getBoolean("cassandra.skip_schema_check");
 
@@ -873,7 +873,7 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
 }
 
-boolean schemasReceived = 
MigrationCoordinator.instance.awaitSchemaRequests(TimeUnit.SECONDS.toMillis(SCHEMA_DELAY));
+boolean schemasReceived = 
MigrationCoordinator.instance.awaitSchemaRequests(SCHEMA_DELAY_MILLIS);
 
 if (schemasReceived)
 return;


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread Blake Eggleston (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231810#comment-17231810
 ] 

Blake Eggleston commented on CASSANDRA-15158:
-

Thanks David, +1

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782
 ] 

David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:20 PM:
--

Starting commit

CI Results: Yellow.  3.1 org.apache.cassandra.service.MigrationCoordinatorTest 
but passes locally, trunk 
org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due 
to schemas not present added commit which increases timeout from 30s to 90s, 
and other expected issues.
||Branch||Source||Circle CI||Jenkins||
|cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]|
|cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]|



was (Author: dcapwell):
Starting commit

CI Results (pending):
||Branch||Source||Circle CI||Jenkins||
|cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]|
|cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]|


> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system 

[jira] [Updated] (CASSANDRA-16262) 4.0 Quality: Coordination & Replication Fuzz Testing

2020-11-13 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16262:

Description: 
CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on 
auditing the existing tests around coordination, replication, and read-repair, 
respectively. We've expanded existing test cases, added coverage around 
components that we've refactored along the way, and added in-JVM dtest upgrade 
tests where possible.

What remains is verifying the distributed read and write paths in the face of 
common operational events, namely node restarts, bootstrapping, decommission, 
and cleanup. If we can find a way to simulate these events, 
[Harry|https://github.com/apache/cassandra-harry] seems like a good candidate 
to host the verification logic itself.

To keep things simple initially, I would propose that we start by testing 
simple read-only and write-only workloads (the former without read repair).

  was:
CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on 
auditing the existing tests around coordination, replication, and read-repair, 
respectively. We've expanded existing test cases, added coverage around 
components that we've refactored along the way, and added in-JVM dtest upgrade 
tests where possible.

What remains is verifying the distributed read and write paths in the face of 
common operational events, namely node restarts, bootstrapping, and 
decommission. If we can find a way to simulate these events, 
[Harry|https://github.com/apache/cassandra-harry] seems like a good candidate 
to host the verification logic itself.

To keep things simple initially, I would propose that we start by testing 
simple read-only and write-only workloads (the former without read repair).


> 4.0 Quality: Coordination & Replication Fuzz Testing
> 
>
> Key: CASSANDRA-16262
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16262
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/fuzz
>Reporter: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-rc
>
>
> CASSANDRA-16180, CASSANDRA-16181, and CASSANDRA-15977 have largely focused on 
> auditing the existing tests around coordination, replication, and 
> read-repair, respectively. We've expanded existing test cases, added coverage 
> around components that we've refactored along the way, and added in-JVM dtest 
> upgrade tests where possible.
> What remains is verifying the distributed read and write paths in the face of 
> common operational events, namely node restarts, bootstrapping, decommission, 
> and cleanup. If we can find a way to simulate these events, 
> [Harry|https://github.com/apache/cassandra-harry] seems like a good candidate 
> to host the verification logic itself.
> To keep things simple initially, I would propose that we start by testing 
> simple read-only and write-only workloads (the former without read repair).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Adam Holmberg (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Holmberg updated CASSANDRA-16185:
--
Status: Patch Available  (was: Review In Progress)

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782
 ] 

David Capwell commented on CASSANDRA-15158:
---

Starting commit

CI Results (pending):
||Branch||Source||Circle CI||Jenkins||
|cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]|
|cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]|


> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231777#comment-17231777
 ] 

David Capwell commented on CASSANDRA-15158:
---

Patches:

3.0: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-3.0
3.11: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-3.11
trunk: https://github.com/dcapwell/cassandra/tree/patchfix/CASSANDRA-15158-trunk

A test was added in CASSANDRA-16213 which showed this issue, its only stable 
once this patch is applied (and disable failing)

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231767#comment-17231767
 ] 

David Capwell commented on CASSANDRA-16213:
---

I plan to fix the schema wait logic in 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231757#comment-17231757
 to keep this patch clean of it, but for the moment the logic is in this branch 
to get a stable test.

> Cannot replace_address /X because it doesn't exist in gossip
> 
>
> Key: CASSANDRA-16213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Membership
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231757#comment-17231757
 ] 

David Capwell commented on CASSANDRA-15158:
---

Found a small set of typos which cause us to wait for schemas for 8h20m rather 
than 30s, going to submit a patch here and fix in all 3 branches...

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231736#comment-17231736
 ] 

David Capwell edited comment on CASSANDRA-16213 at 11/13/20, 6:30 PM:
--

Found the issue, it was caused by CASSANDRA-15158 where it creates a config of 
milliseconds, calls a delay which takes milliseconds, but converts the mills as 
if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which 
will fail since it has a downed node.

{code}
case SCHEMA:
SystemKeyspace.updatePeerInfo(endpoint, 
"schema_version", UUID.fromString(value.value));

MigrationCoordinator.instance.reportEndpointVersion(endpoint, 
UUID.fromString(value.value));
break;
{code}

{code}
boolean schemasReceived = 
MigrationCoordinator.instance.awaitSchemaRequests(SCHEMA_DELAY_MILLIS);

if (schemasReceived)
return;

logger.warn(String.format("There are nodes in the cluster with a 
different schema version than us we did not merged schemas from, " +
  "our version : (%s), outstanding versions -> 
endpoints : %s",
  Schema.instance.getVersion(),
  
MigrationCoordinator.instance.outstandingVersions()));

if (REQUIRE_SCHEMAS)
throw new RuntimeException("Didn't receive schemas for all known 
versions within the timeout");
{code}

when we get the gossip info from the peers it will have node2 (the node that 
crashed abruptly) and wait until it gets the schema, but this won't happen 
since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema 
change with a downed node will cause nodes to fail to start up...


was (Author: dcapwell):
Found the issue, it was caused by CASSANDRA-15158 where it creates a config of 
milliseconds, calls a delay which takes milliseconds, but converts the mills as 
if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which 
will fail since it has a downed node.

{code}
case SCHEMA:
SystemKeyspace.updatePeerInfo(endpoint, 
"schema_version", UUID.fromString(value.value));

MigrationCoordinator.instance.reportEndpointVersion(endpoint, 
UUID.fromString(value.value));
break;
{code}

when we get the gossip info from the peers it will have node2 (the node that 
crashed abruptly) and wait until it gets the schema, but this won't happen 
since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema 
change with a downed node will cause nodes to fail to start up...

> Cannot replace_address /X because it doesn't exist in gossip
> 
>
> Key: CASSANDRA-16213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Membership
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

2020-11-13 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231736#comment-17231736
 ] 

David Capwell commented on CASSANDRA-16213:
---

Found the issue, it was caused by CASSANDRA-15158 where it creates a config of 
milliseconds, calls a delay which takes milliseconds, but converts the mills as 
if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which 
will fail since it has a downed node.

{code}
case SCHEMA:
SystemKeyspace.updatePeerInfo(endpoint, 
"schema_version", UUID.fromString(value.value));

MigrationCoordinator.instance.reportEndpointVersion(endpoint, 
UUID.fromString(value.value));
break;
{code}

when we get the gossip info from the peers it will have node2 (the node that 
crashed abruptly) and wait until it gets the schema, but this won't happen 
since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema 
change with a downed node will cause nodes to fail to start up...

> Cannot replace_address /X because it doesn't exist in gossip
> 
>
> Key: CASSANDRA-16213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Membership
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231691#comment-17231691
 ] 

Sam Tunnicliffe commented on CASSANDRA-15299:
-

I found 1 dtest which broke when updating the driver to current master 
(e1fc528a0d), and opened CASSANDRA-16275 to fix it. 

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16275) Update python driver used by cassandra-dtest

2020-11-13 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16275:

Change Category: Quality Assurance
 Complexity: Low Hanging Fruit
   Assignee: Sam Tunnicliffe
 Status: Open  (was: Triage Needed)

> Update python driver used by cassandra-dtest
> 
>
> Key: CASSANDRA-16275
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16275
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Sam Tunnicliffe
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> In order to commit CASSANDRA-15299, the python driver used by the dtests 
> needs to include PYTHON-1258, support for V5 framing. 
> Updating the python driver's cassandra-test branch to latest trunk causes 1 
> additional dtest failure in 
> {{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the 
> {{ServerError}} response is now subject to the configured {{retry_policy}}. 
> This means the error ultimately returned from the driver is 
> {{NoHostAvailable}}, rather than {{ServerError}}. 
> I'll open a dtest pr to change the expectation in the test and we can commit 
> that when the cassandra-test branch is updated.
> cc [~aholmber] [~aboudreault]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16275) Update python driver used by cassandra-dtest

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231689#comment-17231689
 ] 

Sam Tunnicliffe commented on CASSANDRA-16275:
-

https://github.com/apache/cassandra-dtest/pull/103

> Update python driver used by cassandra-dtest
> 
>
> Key: CASSANDRA-16275
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16275
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Sam Tunnicliffe
>Priority: Normal
>
> In order to commit CASSANDRA-15299, the python driver used by the dtests 
> needs to include PYTHON-1258, support for V5 framing. 
> Updating the python driver's cassandra-test branch to latest trunk causes 1 
> additional dtest failure in 
> {{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the 
> {{ServerError}} response is now subject to the configured {{retry_policy}}. 
> This means the error ultimately returned from the driver is 
> {{NoHostAvailable}}, rather than {{ServerError}}. 
> I'll open a dtest pr to change the expectation in the test and we can commit 
> that when the cassandra-test branch is updated.
> cc [~aholmber] [~aboudreault]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-dtest] branch 16275 deleted (was e3c4f69)

2020-11-13 Thread samt
This is an automated email from the ASF dual-hosted git repository.

samt pushed a change to branch 16275
in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git.


 was e3c4f69  Change expected response when role data is corrupt

This change permanently discards the following revisions:

 discard e3c4f69  Change expected response when role data is corrupt


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-dtest] branch 16275 created (now e3c4f69)

2020-11-13 Thread samt
This is an automated email from the ASF dual-hosted git repository.

samt pushed a change to branch 16275
in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git.


  at e3c4f69  Change expected response when role data is corrupt

This branch includes the following new commits:

 new e3c4f69  Change expected response when role data is corrupt

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-dtest] 01/01: Change expected response when role data is corrupt

2020-11-13 Thread samt
This is an automated email from the ASF dual-hosted git repository.

samt pushed a commit to branch 16275
in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git

commit e3c4f695abccb3ac37bf8c08d2fa304029372b50
Author: Sam Tunnicliffe 
AuthorDate: Fri Nov 13 17:36:22 2020 +

Change expected response when role data is corrupt
---
 auth_test.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/auth_test.py b/auth_test.py
index 1061b17..6d92ff5 100644
--- a/auth_test.py
+++ b/auth_test.py
@@ -216,7 +216,7 @@ class TestAuth(Tester):
 
 self.fixture_dtest_setup.ignore_log_patterns = 
list(self.fixture_dtest_setup.ignore_log_patterns) + [
 r'Invalid metadata has been detected for role bob']
-assert_exception(session, "LIST USERS", "Invalid metadata has been 
detected for role", expected=(ServerError))
+assert_exception(session, "LIST USERS", "Invalid metadata has been 
detected for role", expected=(NoHostAvailable))
 try:
 self.get_session(user='bob', password='12345')
 except NoHostAvailable as e:


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16275) Update python driver used by cassandra-dtest

2020-11-13 Thread Sam Tunnicliffe (Jira)
Sam Tunnicliffe created CASSANDRA-16275:
---

 Summary: Update python driver used by cassandra-dtest
 Key: CASSANDRA-16275
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16275
 Project: Cassandra
  Issue Type: Task
  Components: Test/dtest/python
Reporter: Sam Tunnicliffe


In order to commit CASSANDRA-15299, the python driver used by the dtests needs 
to include PYTHON-1258, support for V5 framing. 

Updating the python driver's cassandra-test branch to latest trunk causes 1 
additional dtest failure in 
{{auth_test.py::TestAuth::test_handle_corrupt_role_data}} because the 
{{ServerError}} response is now subject to the configured {{retry_policy}}. 
This means the error ultimately returned from the driver is 
{{NoHostAvailable}}, rather than {{ServerError}}. 

I'll open a dtest pr to change the expectation in the test and we can commit 
that when the cassandra-test branch is updated.

cc [~aholmber] [~aboudreault]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16191) Add tests for Repair metrics

2020-11-13 Thread Adam Holmberg (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Holmberg updated CASSANDRA-16191:
--
Component/s: (was: Test/dtest/java)
 Test/dtest/python

> Add tests for Repair metrics
> 
>
> Key: CASSANDRA-16191
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16191
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We do not seems to have some tests for the {{RepairMetrics.previewFailures}} 
> counter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16247:
-
  Fix Version/s: 4.0-beta4
  Since Version: 4.0-beta2
Source Control Link: 
https://github.com/apache/cassandra/commit/94663c314a8a2c69a90cc64ac7e60344ba1c60ce
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Committed w/rename nit.

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta, 4.0-beta4
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] branch trunk updated: Relax < check to <= for NodeToolGossipInfoTest

2020-11-13 Thread brandonwilliams
This is an automated email from the ASF dual-hosted git repository.

brandonwilliams pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 94663c3  Relax < check to <= for NodeToolGossipInfoTest
94663c3 is described below

commit 94663c314a8a2c69a90cc64ac7e60344ba1c60ce
Author: Brandon Williams 
AuthorDate: Thu Nov 12 13:45:21 2020 -0600

Relax < check to <= for NodeToolGossipInfoTest

Patch by brandonwilliams, reviewed by samt for CASSANDRA-16247
---
 test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java 
b/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java
index e69f860..caca5ae 100644
--- a/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java
+++ b/test/unit/org/apache/cassandra/tools/NodeToolGossipInfoTest.java
@@ -97,7 +97,7 @@ public class NodeToolGossipInfoTest extends CQLTester
 }
 
 @Test
-public void testTPStats() throws Throwable
+public void testGossipInfo() throws Throwable
 {
 ToolResult tool = ToolRunner.invokeNodetool("gossipinfo");
 Assertions.assertThat(tool.getStdout()).contains("/127.0.0.1");
@@ -125,6 +125,6 @@ public class NodeToolGossipInfoTest extends CQLTester
 assertTrue(tool.getCleanedStderr().isEmpty());
 assertEquals(0, tool.getExitCode());
 String newHeartbeatCount = 
StringUtils.substringBetween(tool.getStdout(), "heartbeat:", "\n");
-assertTrue(Integer.parseInt(origHeartbeatCount) < 
Integer.parseInt(newHeartbeatCount));
+assertTrue(Integer.parseInt(origHeartbeatCount) <= 
Integer.parseInt(newHeartbeatCount));
 }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16247:
-
Status: Ready to Commit  (was: Review In Progress)

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231626#comment-17231626
 ] 

Sam Tunnicliffe commented on CASSANDRA-16247:
-

+1

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16247:
-
Reviewers: Sam Tunnicliffe  (was: Brandon Williams, Sam Tunnicliffe)

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16247:
-
Test and Documentation Plan: 
https://ci-cassandra.apache.org/job/Cassandra-devbranch/199/
 Status: Patch Available  (was: In Progress)

Patch to compare <= instead of <, since that is more correct as to how the 
heartbeat increments with regard to query timing.

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16247) Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest

2020-11-13 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16247:
-
Reviewers: Sam Tunnicliffe, Brandon Williams  (was: Brandon Williams, Sam 
Tunnicliffe)
   Sam Tunnicliffe, Brandon Williams
   Status: Review In Progress  (was: Patch Available)

> Fix flaky test testTPStats - org.apache.cassandra.tools.NodeToolGossipInfoTest
> --
>
> Key: CASSANDRA-16247
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16247
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/764/workflows/6d7a6adc-59d1-4f3c-baae-1f8329dca9b7/jobs/4363
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.tools.NodeToolGossipInfoTest.testTPStats(NodeToolGossipInfoTest.java:128)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Reviewers: Adam Holmberg
   Status: Review In Progress  (was: Patch Available)

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Attachment: 0001-Unit-Test-cases-for-CommitLogMetrics.patch

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Attachment: (was: 0001-Unit-Test-cases-for-CommitLogMetrics.patch)

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Test and Documentation Plan: 
CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases

 

1.  For below metrics, test-case are added in +*CommitLogTest.java*+ . In 
test-cases, mutations are added, which internally update metrics. 
 * *completedTasks*
 * *totalCommitLogSize*
 * *waitingOnCommit*

 

2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added 
testPendingTasks test-case,  since pendingTasks metric are incremented and 
decremented in single method of  *AbstractCommitLogService::maybeWaitForSync* .

So introduced a dummy method incrementPendingTaks in  
*FakeCommitLogService-class,* to manually update pendingTasks metric.

 

3. *waitingOnSegmentAllocation* - In +*CommitLogMetricsTest.java*+ test-case is 
added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding 
multiple  mutations so that waitingOnSegmentAllocation metric updated while 
creating new segments.

In normal case , waitingOnSegmentAllocation it will be zero, if during 
test-execution, waitingOnSegmentAllocation was not updated, then manually 
updating it.

  was:
CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases

 

1.  For below metrics, test-case are added in +*CommitLogTest.java*+ . In 
test-cases, mutations are added, which internally update metrics. 
 * *completedTasks* 
 * *totalCommitLogSize* 
 * *waitingOnCommit*

 

2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added 
testPendingTasks test-case,  since pendingTasks metric are incremented and 
decremented in single method of  *AbstractCommitLogService::maybeWaitForSync* .

So introduced a dummy method incrementPendingTaks in  
*FakeCommitLogService-class,* to manually update pendingTasks metric.

 

3. *waitingOnSegmentAllocation* - In *CommitLogMetricsTest.java* test-case is 
added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding 
multiple  mutations so that waitingOnSegmentAllocation metric updated while 
creating new segments.

In normal case , waitingOnSegmentAllocation it will be zero, if during 
test-execution, waitingOnSegmentAllocation was not updated, then manually 
updating it.


> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Test and Documentation Plan: 
CommitLogMetrics Test-CasesCommitLogMetrics Test-Cases

 

1.  For below metrics, test-case are added in +*CommitLogTest.java*+ . In 
test-cases, mutations are added, which internally update metrics. 
 * *completedTasks* 
 * *totalCommitLogSize* 
 * *waitingOnCommit*

 

2. *pendingTasks* - In +*AbstractCommitLogService.java*+ , added 
testPendingTasks test-case,  since pendingTasks metric are incremented and 
decremented in single method of  *AbstractCommitLogService::maybeWaitForSync* .

So introduced a dummy method incrementPendingTaks in  
*FakeCommitLogService-class,* to manually update pendingTasks metric.

 

3. *waitingOnSegmentAllocation* - In *CommitLogMetricsTest.java* test-case is 
added. In test *commitlog_segment_size_in_mb* is changed to *1mb*, and adding 
multiple  mutations so that waitingOnSegmentAllocation metric updated while 
creating new segments.

In normal case , waitingOnSegmentAllocation it will be zero, if during 
test-execution, waitingOnSegmentAllocation was not updated, then manually 
updating it.
 Status: Patch Available  (was: In Progress)

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231559#comment-17231559
 ] 

Sam Tunnicliffe commented on CASSANDRA-15299:
-

[~aholmber] that's great, thanks! I'll make sure everything's passing against 
trunk with the latest driver and get back to you. 

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231545#comment-17231545
 ] 

Adam Holmberg edited comment on CASSANDRA-15299 at 11/13/20, 2:57 PM:
--

{quote}This is the thing I'm unsure about - the python driver is not an asf 
project (yet), so it's not really up to me/us whether we update that branch ... 
TBH, I don't recall the full reasoning for using the cassandra-test branch in 
the first place ...{quote}

The {{cassandra-test}} branch was created explicitly to allow server-side 
testing of merged client-impacting features, independent of driver releases. It 
was born of a time when there were multiple client-impacting changes in flight 
so a single commit would not suffice. Normally we would coordinate a merge into 
that driver branch with a PR for the server, so CI could be tested with it.

Updating the branch should not be an issue. I, or [~aboudreault] can facilitate.


was (Author: aholmber):
{quote}This is the thing I'm unsure about - the python driver is not an asf 
project (yet), so it's not really up to me/us whether we update that branch ... 
TBH, I don't recall the full reasoning for using the cassandra-test branch in 
the first place ...{quote}

The {{cassandra-test}} branch was created explicitly to allow server-side 
testing of merged client-impacting features, independent of driver releases. It 
was born of a time when there were multiple client-impacting changes in-flight 
so a single commit would not suffice. Normally we would coordinate a merge into 
that driver branch with a PR for the server, so CI could be tested with it.

Updating the branch should not be an issue. I, or [~aboudreault] can facilitate.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231545#comment-17231545
 ] 

Adam Holmberg commented on CASSANDRA-15299:
---

{quote}This is the thing I'm unsure about - the python driver is not an asf 
project (yet), so it's not really up to me/us whether we update that branch ... 
TBH, I don't recall the full reasoning for using the cassandra-test branch in 
the first place ...{quote}

The {{cassandra-test}} branch was created explicitly to allow server-side 
testing of merged client-impacting features, independent of driver releases. It 
was born of a time when there were multiple client-impacting changes in-flight 
so a single commit would not suffice. Normally we would coordinate a merge into 
that driver branch with a PR for the server, so CI could be tested with it.

Updating the branch should not be an issue. I, or [~aboudreault] can facilitate.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231525#comment-17231525
 ] 

Michael Semb Wever commented on CASSANDRA-15299:


bq. Absolutely, I published them myself to test, but I definitely think we 
should have them under /apache.

Done, ref: INFRA-21103. Let's see how that goes.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231489#comment-17231489
 ] 

Sam Tunnicliffe commented on CASSANDRA-15299:
-

{quote}If we're updating to use a new updated version of the driver, does that 
mean the cassandra-test branch is being sync'd up to master in the 
progress?{quote}

This is the thing I'm unsure about - the python driver is not an asf project 
(yet), so it's not really up to me/us whether we update that branch (of course 
I can open a PR to the driver to do that). Binding cassandra-dtest to a 
specific published commit is something wholly within our wheelhouse though, so 
that was the route I took. TBH, I don't recall the full reasoning for using the 
cassandra-test branch in the first place (backports I suppose).

{quote}Is it time to start deploying these images under apache/ ?
If agreed, I can open an infra ticket to set up deployment of docker 
images.{quote}

Absolutely, I published them myself to test, but I definitely think we should 
have them under /apache.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16246) Unexpected warning "Ignoring Unrecognized strategy option" for NetworkTopologyStrategy when restarting

2020-11-13 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16246:

  Fix Version/s: (was: 4.0-beta)
 4.0-beta4
  Since Version: 3.0.0
Source Control Link: 
https://github.com/apache/cassandra/commit/fde640fe52704836ec21fedd62cae21290e099ec
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Committed to trunk in {{fde640fe52704836ec21fedd62cae21290e099ec}}, thanks!

> Unexpected warning "Ignoring Unrecognized strategy option" for 
> NetworkTopologyStrategy when restarting
> --
>
> Key: CASSANDRA-16246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16246
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Logging
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0-beta4
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> During restarting, bunch of warning messages like 
> "AbstractReplicationStrategy.java:364 - Ignoring Unrecognized strategy option 
> {datacenter2} passed to NetworkTopologyStrategy for keyspace 
> distributed_test_keyspace" are logged. 
> The warnings are not expected since the mentioned DC exist. 
> It seems to be caused by the improper order during startup, so that when 
> opening keyspaces it is unaware of DCs. 
> The warning can be reproduced using the test below. 
> {code:java}
> @Test
> public void testEmitsWarningsForNetworkTopologyStategyConfigOnRestart() 
> throws Exception {
> int nodesPerDc = 2;
> try (Cluster cluster = builder().withConfig(c -> c.with(GOSSIP, NETWORK))
> .withRacks(2, 1, nodesPerDc)
> .start()) {
> cluster.schemaChange("CREATE KEYSPACE " + KEYSPACE +
>  " WITH replication = {'class': 
> 'NetworkTopologyStrategy', " +
>  "'datacenter1' : " + nodesPerDc + ", 
> 'datacenter2' : " + nodesPerDc + " };");
> cluster.get(2).nodetool("flush");
> System.out.println("Stop node 2 in datacenter 1");
> cluster.get(2).shutdown().get();
> System.out.println("Start node 2 in datacenter 1");
> cluster.get(2).startup();
> List result = cluster.get(2).logs().grep("Ignoring 
> Unrecognized strategy option \\{datacenter2\\}").getResult();
> Assert.assertFalse(result.isEmpty());
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16246) Unexpected warning "Ignoring Unrecognized strategy option" for NetworkTopologyStrategy when restarting

2020-11-13 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16246:

Status: Ready to Commit  (was: Changes Suggested)

Thanks, LGTM

> Unexpected warning "Ignoring Unrecognized strategy option" for 
> NetworkTopologyStrategy when restarting
> --
>
> Key: CASSANDRA-16246
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16246
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Logging
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0-beta
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> During restarting, bunch of warning messages like 
> "AbstractReplicationStrategy.java:364 - Ignoring Unrecognized strategy option 
> {datacenter2} passed to NetworkTopologyStrategy for keyspace 
> distributed_test_keyspace" are logged. 
> The warnings are not expected since the mentioned DC exist. 
> It seems to be caused by the improper order during startup, so that when 
> opening keyspaces it is unaware of DCs. 
> The warning can be reproduced using the test below. 
> {code:java}
> @Test
> public void testEmitsWarningsForNetworkTopologyStategyConfigOnRestart() 
> throws Exception {
> int nodesPerDc = 2;
> try (Cluster cluster = builder().withConfig(c -> c.with(GOSSIP, NETWORK))
> .withRacks(2, 1, nodesPerDc)
> .start()) {
> cluster.schemaChange("CREATE KEYSPACE " + KEYSPACE +
>  " WITH replication = {'class': 
> 'NetworkTopologyStrategy', " +
>  "'datacenter1' : " + nodesPerDc + ", 
> 'datacenter2' : " + nodesPerDc + " };");
> cluster.get(2).nodetool("flush");
> System.out.println("Stop node 2 in datacenter 1");
> cluster.get(2).shutdown().get();
> System.out.println("Start node 2 in datacenter 1");
> cluster.get(2).startup();
> List result = cluster.get(2).logs().grep("Ignoring 
> Unrecognized strategy option \\{datacenter2\\}").getResult();
> Assert.assertFalse(result.isEmpty());
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra] branch trunk updated: Add saved Host IDs to TokenMetadata during startup

2020-11-13 Thread samt
This is an automated email from the ASF dual-hosted git repository.

samt pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
 new fde640f  Add saved Host IDs to TokenMetadata during startup
fde640f is described below

commit fde640fe52704836ec21fedd62cae21290e099ec
Author: yifan-c 
AuthorDate: Thu Nov 5 17:54:11 2020 -0800

Add saved Host IDs to TokenMetadata during startup

Patch by Yifan Cai; reviewed by Sam Tunnicliffe for CASSANDRA-16246
---
 CHANGES.txt|  1 +
 .../apache/cassandra/service/StorageService.java   | 73 ++
 .../cassandra/distributed/impl/Instance.java   |  9 +++
 .../distributed/test/NetworkTopologyTest.java  | 26 +++-
 4 files changed, 68 insertions(+), 41 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index cb4d5bc..cbcc091 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-beta4
+ * Add saved Host IDs to TokenMetadata at startup (CASSANDRA-16246)
  * Ensure that CacheMetrics.requests is picked up by the metric reporter 
(CASSANDRA-16228)
  * Add a ratelimiter to snapshot creation and deletion (CASSANDRA-13019)
  * Produce consistent tombstone for reads to avoid digest mistmatch 
(CASSANDRA-15369)
diff --git a/src/java/org/apache/cassandra/service/StorageService.java 
b/src/java/org/apache/cassandra/service/StorageService.java
index 4a3477c..3201d80 100644
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@ -637,21 +637,6 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 MessagingService.instance().listen();
 }
 
-public void populateTokenMetadata()
-{
-if 
(Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true")))
-{
-logger.info("Populating token metadata from system tables");
-Multimap loadedTokens = 
SystemKeyspace.loadTokens();
-if (!shouldBootstrap()) // if we have not completed bootstrapping, 
we should not add ourselves as a normal token
-loadedTokens.putAll(FBUtilities.getBroadcastAddressAndPort(), 
SystemKeyspace.getSavedTokens());
-for (InetAddressAndPort ep : loadedTokens.keySet())
-tokenMetadata.updateNormalTokens(loadedTokens.get(ep), ep);
-
-logger.info("Token metadata: {}", tokenMetadata);
-}
-}
-
 public synchronized void initServer() throws ConfigurationException
 {
 initServer(RING_DELAY);
@@ -676,6 +661,14 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 throw new AssertionError(e);
 }
 
+if 
(Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true")))
+{
+logger.info("Loading persisted ring state");
+populatePeerTokenMetadata();
+for (InetAddressAndPort endpoint : tokenMetadata.getAllEndpoints())
+Gossiper.runInGossipStageBlocking(() -> 
Gossiper.instance.addSavedEndpoint(endpoint));
+}
+
 // daemon threads, like our executors', continue to run while shutdown 
hooks are invoked
 drainOnShutdown = NamedThreadFactory.createThread(new WrappedRunnable()
 {
@@ -697,8 +690,6 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 if (!Boolean.parseBoolean(System.getProperty("cassandra.start_gossip", 
"true")))
 {
 logger.info("Not starting gossip as requested.");
-// load ring state in preparation for starting gossip later
-loadRingState();
 initialized = true;
 return;
 }
@@ -740,27 +731,34 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 initialized = true;
 }
 
-private void loadRingState()
+public void populateTokenMetadata()
 {
 if 
(Boolean.parseBoolean(System.getProperty("cassandra.load_ring_state", "true")))
 {
-logger.info("Loading persisted ring state");
-Multimap loadedTokens = 
SystemKeyspace.loadTokens();
-Map loadedHostIds = 
SystemKeyspace.loadHostIds();
-for (InetAddressAndPort ep : loadedTokens.keySet())
-{
-if (ep.equals(FBUtilities.getBroadcastAddressAndPort()))
-{
-// entry has been mistakenly added, delete it
-SystemKeyspace.removeEndpoint(ep);
-}
-else
-{
-if (loadedHostIds.containsKey(ep))
-tokenMetadata.updateHostId(loadedHostIds.get(ep), ep);
-Gossiper.runInGossipStageBlocking(() -> 
Gossiper.instance.addSavedEndpoint(ep));
-}

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231462#comment-17231462
 ] 

Michael Semb Wever edited comment on CASSANDRA-15299 at 11/13/20, 1:48 PM:
---

bq. One thing that's a bit concerning is that the cassandra-test branch of the 
driver, which is what dtests are currently using, is currently 693 commits 
behind the master branch.

If we're updating to use a new updated version of the driver, does that mean 
the {{cassandra-test}} branch is being sync'd up to master in the progress? 

{quote}Docker images:
- beobal/cassandra-testing-ubuntu1910-java11:2020
- beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote}

Is it time to start deploying these images under 
[{{apache/}}|https://hub.docker.com/u/apache] ?
If agreed, I can open an infra ticket to set up deployment of docker images.

bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any 
further here.

Go for it! :-)



was (Author: michaelsembwever):
bq. One thing that's a bit concerning is that the cassandra-test branch of the 
driver, which is what dtests are currently using, is currently 693 commits 
behind the master branch.

If we're updating to use a new updated version of the driver, does that mean 
the {{cassandra-test}} branch being sync'd up to master in the progress? 

{quote}Docker images:
- beobal/cassandra-testing-ubuntu1910-java11:2020
- beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote}

Is it time to start deploying these images under 
[{{apache/}}|https://hub.docker.com/u/apache] ?
If agreed, I can open an infra ticket to set up deployment of docker images.

bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any 
further here.

Go for it! :-)


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> 

[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231462#comment-17231462
 ] 

Michael Semb Wever commented on CASSANDRA-15299:


bq. One thing that's a bit concerning is that the cassandra-test branch of the 
driver, which is what dtests are currently using, is currently 693 commits 
behind the master branch.

If we're updating to use a new updated version of the driver, does that mean 
the {{cassandra-test}} branch being sync'd up to master in the progress? 

{quote}Docker images:
- beobal/cassandra-testing-ubuntu1910-java11:2020
- beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote}

Is it time to start deploying these images under 
[{{apache/}}|https://hub.docker.com/u/apache] ?
If agreed, I can open an infra ticket to set up deployment of docker images.

bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any 
further here.

Go for it! :-)


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14477) The check of num_tokens against the length of inital_token in the yaml triggers unexpectedly

2020-11-13 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-14477:
---
Status: Changes Suggested  (was: Review In Progress)

> The check of num_tokens against the length of inital_token in the yaml 
> triggers unexpectedly
> 
>
> Key: CASSANDRA-14477
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14477
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Vincent White
>Assignee: Stefan Miklosovic
>Priority: Low
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In CASSANDRA-10120 we added a check that compares num_tokens against the 
> number of tokens supplied in the yaml via initial_token. From my reading of 
> CASSANDRA-10120 it was to prevent cassandra starting if the yaml contained 
> contradictory values for num_tokens and initial_tokens which should help 
> prevent misconfiguration via human error. The current behaviour appears to 
> differ slightly in that it performs this comparison regardless of whether 
> num_tokens is included in the yaml or not. Below are proposed patches to only 
> perform the check if both options are present in the yaml.
> ||Branch||
> |[3.0.x|https://github.com/apache/cassandra/compare/cassandra-3.0...vincewhite:num_tokens_30]|
> |[3.x|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:num_tokens_test_1_311]|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution

2020-11-13 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231443#comment-17231443
 ] 

Michael Semb Wever commented on CASSANDRA-16201:


Out of the ticket scope… why are the microbench classes all in the 
{{org.apache.cassandra.test.microbench}} ? They are already separate under 
{{src/testmicrobench/}}, and by re-packaging them like this it means accessed 
methods: eg {{bs.getMutations(..)}} ; have to be made public instead of 
package-protected. It would be nice to keep methods package-protected where 
possible.

AFAIK we also don't CI run the microbench classes anywhere, so there's no 
guarantee they remain runnable over time. I could add them to the ci-cassandra 
pipeline, though ideally a dedicated bare-metal server would be needed to make 
[use|https://plugins.jenkins.io/jmh-report/] of the runtime 
[reports|https://www.jenkins.io/blog/2019/06/21/performance-testing-jenkins/].

+1 on all branch patches (including [~yifanc] review comments above).

> Reduce amount of allocations during batch statement execution
> -
>
> Key: CASSANDRA-16201
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16201
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Thomas Steinmaurer
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, 
> 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, 
> 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, 
> we see 4.0b2 going OOM from time to time. According to a heap dump, we have 
> multiple NTR threads in a 3-digit MB range.
> This is likely related to object array pre-allocations at the size of 
> {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always 
> only 1 {{BTreeRow}} in the {{BTree}}.
>  !screenshot-1.png|width=100%! 
> So it seems we have many, many 20K elemnts pre-allocated object arrays 
> resulting in a shallow heap of 80K each, although there is only one element 
> in the array.
> This sort of pre-allocation is causing a lot of memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231424#comment-17231424
 ] 

Krishna Vadali commented on CASSANDRA-16271:


My bad forgot attaching the diff, added it now.

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Vadali updated CASSANDRA-16271:
---
Attachment: sleep_before_replace.diff

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16266) Stress testing a mixed cluster with C* 2.1.0 (seed) and 2.0.0 causes NPE

2020-11-13 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231420#comment-17231420
 ] 

Brandon Williams commented on CASSANDRA-16266:
--

Thank you for your detailed analysis, it will be very helpful in future 
versions.

> Stress testing a mixed cluster with C* 2.1.0 (seed) and 2.0.0 causes NPE
> 
>
> Key: CASSANDRA-16266
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16266
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Yongle Zhang
>Priority: Normal
>
> Steps to reproduce: 
>  # setup a mixed cluster with C* 2.1.0 (seed node) and C* 2.0.0
>  # run the stress testing tool, e.g.,
> {code:java}
> /cassandra/tools/bin/cassandra-stress write n=1000 -rate threads=50 -node 
> 250.16.238.1,250.16.238.2{code}
> NPE: 
> {code:java}
> ERROR [InternalResponseStage:2] 2020-07-22 08:29:36,170 CassandraDaemon.java 
> (line 186) Exception in thread Thread[InternalResponseStage:2,5,main]
> java.lang.NullPointerException
>   at 
> org.apache.cassandra.serializers.BooleanSerializer.deserialize(BooleanSerializer.java:33)
>   at 
> org.apache.cassandra.serializers.BooleanSerializer.deserialize(BooleanSerializer.java:24)
>   at 
> org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:142)
>   at 
> org.apache.cassandra.cql3.UntypedResultSet$Row.getBoolean(UntypedResultSet.java:106)
>   at 
> org.apache.cassandra.config.CFMetaData.fromSchemaNoColumnsNoTriggers(CFMetaData.java:1555)
>   at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1642)
>   at 
> org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:305)
>   at 
> org.apache.cassandra.db.DefsTables.mergeColumnFamilies(DefsTables.java:270)
>   at org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:183)
>   at 
> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:66)
>   at 
> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:46)
>   at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Root cause: incompatible data
> In the `CFMetaData` class of version 2.0.0, there is a boolean field named 
> `replicate_on_write`. In the same class of version 2.1.0, however, this field 
> no longer exists. When serializing this class in function 
> `toSchemaNoColumnsNoTriggers`, it will first write all of its fields into a 
> `RowMutation` (in 2.0.0) / `Mutation` (in 2.1.0) class, and then serialize 
> this “Mutation” like class in the same way. In 2.0.0 the `replicate_on_write` 
> field gets serialized at 
> [https://github.com/apache/cassandra/blob/03045ca22b11b0e5fc85c4fabd83ce6121b5709b/src/java/org/apache/cassandra/config/CFMetaData.java#L1514]
>  .
> When deserializing this class in function `fromSchemaNoColumnsNoTriggers`, it 
> reads all its fields from a map-like class `UntypedResultSet.Row`. In 2.0.0 
> the `replicate_on_write` field gets deserialized at 
> [https://github.com/apache/cassandra/blob/03045ca22b11b0e5fc85c4fabd83ce6121b5709b/src/java/org/apache/cassandra/config/CFMetaData.java#L1555]
>  .
> The problem is that the existence of the key is not checked, and the map 
> returns a `null` value because the message from 2.1.0 doesn’t contain the 
> `replicate_on_write` key, which leads to the NullPointerException.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231411#comment-17231411
 ] 

Krishna Vadali commented on CASSANDRA-16271:


Thanks [~samt] looking forward to your patch.

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Vadali updated CASSANDRA-16271:
---
Reviewers: Krishna Vadali, Paulo Motta  (was: Paulo Motta)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

Reviewers: Paulo Motta

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231406#comment-17231406
 ] 

Paulo Motta commented on CASSANDRA-16271:
-

Thanks for taking this Sam. I'd be happy to review it as I'm also familiar with 
this issue.

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe reassigned CASSANDRA-16271:
---

Assignee: Sam Tunnicliffe

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16271:

Status: Open  (was: Triage Needed)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231405#comment-17231405
 ] 

Sam Tunnicliffe commented on CASSANDRA-16271:
-

I'm pretty sure this is fixed in trunk as a side effect of the rework to 
replication for Transient Replication (CASSANDRA-14404). I'm familiar with this 
issue, so I'll try and post a patch shortly.

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Paulo Motta (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

 Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Implementation(12988)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
Since Version: 2.2.8

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16013) sstablescrub unit test hardening and docs improvements

2020-11-13 Thread Berenguer Blasi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-16013:

Summary: sstablescrub unit test hardening and docs improvements  (was: 
sstablescrub unit test hardening an docs improvements)

> sstablescrub unit test hardening and docs improvements
> --
>
> Key: CASSANDRA-16013
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16013
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/sstable
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> During CASSANDRA-15883 / CASSANDRA-15991 it was detected unit test coverage 
> for this tool is minimal. There is a unit test to enhance upon under 
> {{test/unit/org/apache/cassandra/tools}}. Also docs need updating to reflect 
> the latest options available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-11-13 Thread Marcus Eriksson (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231396#comment-17231396
 ] 

Marcus Eriksson commented on CASSANDRA-15580:
-

should probably wait for CASSANDRA-16274 before testing with {{-os}}

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231382#comment-17231382
 ] 

Yasar Arafath Baigh commented on CASSANDRA-16185:
-

CommitLogMetrics Test-case patch is attached.

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15957) org.apache.cassandra.repair.RepairJobTest testOptimizedCreateStandardSyncTasks

2020-11-13 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15957:

Resolution: Fixed
Status: Resolved  (was: Open)

CASSANDRA-16274 contains a fix for this

> org.apache.cassandra.repair.RepairJobTest testOptimizedCreateStandardSyncTasks
> --
>
> Key: CASSANDRA-15957
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15957
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Build: 
> https://ci-cassandra.apache.org/job/Cassandra-trunk-test/lastCompletedBuild/testReport/junit/org.apache.cassandra.repair/RepairJobTest/testOptimizedCreateStandardSyncTasks/
> Expecting:
>  <[#,
>#]>
> to contain only:
>  <[(,0001]]>
> but the following elements were unexpected:
>  <[#]>
> This failed 3 times in a row on Jenkins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming

2020-11-13 Thread Marcus Eriksson (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231378#comment-17231378
 ] 

Marcus Eriksson edited comment on CASSANDRA-16274 at 11/13/20, 11:02 AM:
-

patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt
cci: 
https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt

A few commits in this, basic idea for the optimisations is to only iterate over 
the ranges that can overlap instead of all diffing ranges.

When picking endpoints to stream from, we always pick the next node sorted by 
ip address - does not matter which node we pick as long as we pick the same one.

This branch also contains a fix for CASSANDRA-15957


was (Author: krummas):
patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt
cci: 
https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt

> Improve performance when calculating StreamTasks with optimised streaming
> -
>
> Key: CASSANDRA-16274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16274
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta4
>
>
> The way stream tasks are calculated currently is quite inefficient, improve 
> that.
> Also, we currently try to distribute the streaming nodes evenly, this creates 
> many more sstables than necessary - instead we should try to stream 
> everything from a single peer, this should reduce the number of sstables 
> created on the out-of-sync node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException

2020-11-13 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231379#comment-17231379
 ] 

Benjamin Lerer edited comment on CASSANDRA-16259 at 11/13/20, 11:02 AM:


{quote}If I understand the change within CASSANDRA-15164 right, then the 
storage format of SSTable statistics has changed, which should also bump the 
SSTable version, shouldn't it?{quote}

The storage format did not change. The encoding was already 
{{\*}}. The old code is able to read the 
statistics produced by the new one. The problem is only at the metric level 
where C* try to merge some SSTable histograms that have a different number of 
buckets with some buggy code.

When you scrub the old SSTable, it is recreated with the new number of buckets 
ensuring that you will not hit the TableMetric bug again. 


was (Author: blerer):
{quote}If I understand the change within CASSANDRA-15164 right, then the 
storage format of SSTable statistics has changed, which should also bump the 
SSTable version, shouldn't it?{quote}

The storage format did not change. The encoding was already 
{{*}}. The old code is able to read the 
statistics produced by the new one. The problem is only at the metric level 
where C* try to merge some SSTable histograms that have a different number of 
buckets with some buggy code.

When you scrub the old SSTable, it is recreated with the new number of buckets 
ensuring that you will not hit the TableMetric bug again. 

> tablehistograms cause ArrayIndexOutOfBoundsException
> 
>
> Key: CASSANDRA-16259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16259
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Justin Montgomery
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta
>
>
> After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error 
> appeared on the upgraded nodes when trying to access *tablehistograms*. The 
> same command run on our .8 nodes return as expected, only the upgraded .9 
> nodes fail. Not all tables fail when queried, but about 90% of them do.
> We use Datastax MCAC which appears to query histograms every 30 seconds, this 
> outputs to the system.log:
> {noformat}
> WARN  [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - 
> Error reporting:
> java.lang.ArrayIndexOutOfBoundsException: 115
> at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>  ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626)
>  ~[datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798)
>  [datastax-mcac-agent.jar:na]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_272]
> at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat}
> Manually trying a histogram from the CLI:
> {noformat}
> $ nodetool tablehistograms logdata log_height_index
> error: 115
> -- StackTrace --
> java.lang.ArrayIndexOutOfBoundsException: 115
>   at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>   at 
> 

[jira] [Commented] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException

2020-11-13 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231379#comment-17231379
 ] 

Benjamin Lerer commented on CASSANDRA-16259:


{quote}If I understand the change within CASSANDRA-15164 right, then the 
storage format of SSTable statistics has changed, which should also bump the 
SSTable version, shouldn't it?{quote}

The storage format did not change. The encoding was already 
{{*}}. The old code is able to read the 
statistics produced by the new one. The problem is only at the metric level 
where C* try to merge some SSTable histograms that have a different number of 
buckets with some buggy code.

When you scrub the old SSTable, it is recreated with the new number of buckets 
ensuring that you will not hit the TableMetric bug again. 

> tablehistograms cause ArrayIndexOutOfBoundsException
> 
>
> Key: CASSANDRA-16259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16259
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Justin Montgomery
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta
>
>
> After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error 
> appeared on the upgraded nodes when trying to access *tablehistograms*. The 
> same command run on our .8 nodes return as expected, only the upgraded .9 
> nodes fail. Not all tables fail when queried, but about 90% of them do.
> We use Datastax MCAC which appears to query histograms every 30 seconds, this 
> outputs to the system.log:
> {noformat}
> WARN  [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - 
> Error reporting:
> java.lang.ArrayIndexOutOfBoundsException: 115
> at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>  ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626)
>  ~[datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798)
>  [datastax-mcac-agent.jar:na]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_272]
> at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat}
> Manually trying a histogram from the CLI:
> {noformat}
> $ nodetool tablehistograms logdata log_height_index
> error: 115
> -- StackTrace --
> java.lang.ArrayIndexOutOfBoundsException: 115
>   at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>   at 
> org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48)
>   at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376)
>   at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373)
>   at 
> org.apache.cassandra.metrics.CassandraMetricsRegistry$JmxGauge.getValue(CassandraMetricsRegistry.java:250)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Updated] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming

2020-11-13 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-16274:

Test and Documentation Plan: new unit tests, jvm dtests
 Status: Patch Available  (was: Open)

patch: https://github.com/krummas/cassandra/commits/marcuse/3200opt
cci: 
https://app.circleci.com/pipelines/github/krummas/cassandra?branch=marcuse%2F3200opt

> Improve performance when calculating StreamTasks with optimised streaming
> -
>
> Key: CASSANDRA-16274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16274
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta4
>
>
> The way stream tasks are calculated currently is quite inefficient, improve 
> that.
> Also, we currently try to distribute the streaming nodes evenly, this creates 
> many more sstables than necessary - instead we should try to stream 
> everything from a single peer, this should reduce the number of sstables 
> created on the out-of-sync node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming

2020-11-13 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-16274:

Change Category: Performance
 Complexity: Normal
Component/s: Consistency/Repair
  Fix Version/s: 4.0-beta4
 Status: Open  (was: Triage Needed)

> Improve performance when calculating StreamTasks with optimised streaming
> -
>
> Key: CASSANDRA-16274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16274
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 4.0-beta4
>
>
> The way stream tasks are calculated currently is quite inefficient, improve 
> that.
> Also, we currently try to distribute the streaming nodes evenly, this creates 
> many more sstables than necessary - instead we should try to stream 
> everything from a single peer, this should reduce the number of sstables 
> created on the out-of-sync node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16274) Improve performance when calculating StreamTasks with optimised streaming

2020-11-13 Thread Marcus Eriksson (Jira)
Marcus Eriksson created CASSANDRA-16274:
---

 Summary: Improve performance when calculating StreamTasks with 
optimised streaming
 Key: CASSANDRA-16274
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16274
 Project: Cassandra
  Issue Type: Improvement
Reporter: Marcus Eriksson
Assignee: Marcus Eriksson


The way stream tasks are calculated currently is quite inefficient, improve 
that.

Also, we currently try to distribute the streaming nodes evenly, this creates 
many more sstables than necessary - instead we should try to stream everything 
from a single peer, this should reduce the number of sstables created on the 
out-of-sync node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231372#comment-17231372
 ] 

Benjamin Lerer commented on CASSANDRA-16189:


I should have time next week for the review. Thanks.

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16185) Add tests to cover CommitLog metrics

2020-11-13 Thread Yasar Arafath Baigh (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasar Arafath Baigh updated CASSANDRA-16185:

Attachment: 0001-Unit-Test-cases-for-CommitLogMetrics.patch

> Add tests to cover CommitLog metrics
> 
>
> Key: CASSANDRA-16185
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16185
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Benjamin Lerer
>Assignee: Yasar Arafath Baigh
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-Unit-Test-cases-for-CommitLogMetrics.patch
>
>
> The only metrics that seems to be covered by unit test for the CommitLog 
> metrics is {{oversizedMutations}}. We should add testing the other ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Mohamed Zafraan (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231279#comment-17231279
 ] 

Mohamed Zafraan commented on CASSANDRA-16189:
-

that's fine. do let me know if there's anything to do on my side.

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231240#comment-17231240
 ] 

Benjamin Lerer commented on CASSANDRA-16189:


Sorry, for the noise around the reviewer and status. I add to do some testing 
for INFRA-21091 and used that ticket for it.

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-16189:
---
Reviewers: Benjamin Lerer  (was: Adam Holmberg)

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-16189:
---
Reviewers: Adam Holmberg
   Status: Review In Progress  (was: Patch Available)

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-16189:
---
Status: Patch Available  (was: Review In Progress)

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16259) tablehistograms cause ArrayIndexOutOfBoundsException

2020-11-13 Thread Tibor Repasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231238#comment-17231238
 ] 

Tibor Repasi commented on CASSANDRA-16259:
--

Well, that would explain why scrubbing the table fixed it.

If I understand the change within CASSANDRA-15164 right, then the storage 
format of SSTable statistics has changed, which should also bump the SSTable 
version, shouldn't it?

> tablehistograms cause ArrayIndexOutOfBoundsException
> 
>
> Key: CASSANDRA-16259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16259
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Justin Montgomery
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0-beta
>
>
> After upgrading some nodes in our cluster from 3.11.8 to 3.11.9 an error 
> appeared on the upgraded nodes when trying to access *tablehistograms*. The 
> same command run on our .8 nodes return as expected, only the upgraded .9 
> nodes fail. Not all tables fail when queried, but about 90% of them do.
> We use Datastax MCAC which appears to query histograms every 30 seconds, this 
> outputs to the system.log:
> {noformat}
> WARN  [insights-3-1] 2020-11-09 01:11:22,331 UnixSocketClient.java:830 - 
> Error reporting:
> java.lang.ArrayIndexOutOfBoundsException: 115
> at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>  ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373) 
> ~[apache-cassandra-3.11.9.jar:3.11.9]
> at 
> com.datastax.mcac.UnixSocketClient.writeMetric(UnixSocketClient.java:839) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.access$700(UnixSocketClient.java:78) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient$2.lambda$onGaugeAdded$0(UnixSocketClient.java:626)
>  ~[datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.writeGroup(UnixSocketClient.java:819) 
> [datastax-mcac-agent.jar:na]
> at 
> com.datastax.mcac.UnixSocketClient.lambda$restartMetricReporting$2(UnixSocketClient.java:798)
>  [datastax-mcac-agent.jar:na]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_272]
> at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:126)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307) 
> ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_272]{noformat}
> Manually trying a histogram from the CLI:
> {noformat}
> $ nodetool tablehistograms logdata log_height_index
> error: 115
> -- StackTrace --
> java.lang.ArrayIndexOutOfBoundsException: 115
>   at 
> org.apache.cassandra.metrics.TableMetrics.combineHistograms(TableMetrics.java:261)
>   at 
> org.apache.cassandra.metrics.TableMetrics.access$000(TableMetrics.java:48)
>   at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:376)
>   at 
> org.apache.cassandra.metrics.TableMetrics$11.getValue(TableMetrics.java:373)
>   at 
> org.apache.cassandra.metrics.CassandraMetricsRegistry$JmxGauge.getValue(CassandraMetricsRegistry.java:250)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Updated] (CASSANDRA-16189) Add tests for the Hint service metrics

2020-11-13 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-16189:
---
Reviewers:   (was: Benjamin Lerer)

> Add tests for the Hint service metrics
> --
>
> Key: CASSANDRA-16189
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16189
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/python
>Reporter: Benjamin Lerer
>Assignee: Mohamed Zafraan
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 0001-added-hints-metrics-test.patch
>
>
> There are currently no tests for the hint metrics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15582) 4.0 quality testing: metrics

2020-11-13 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-15582:
---
Description: 
The goal of this ticket is to have a proper testing of the different metrics 
exposed via JMX, and to ensure that metrics that are not in used in 4.0 have 
been properly deprecated.

The following table show the current status of the metric tests and can be used 
to track the progress of that ticket:

|| Metrics || Status || test types || JIRA tickets ||
| Batch | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15718 | 
| BufferPool | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15773 |
| Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 |
| Client | {color:#DE350B}*TESTS MISSING*{color}| unit tests | CASSANDRA-16216 |
| ClientRequest | {color:#00875A}*COVERED*{color} | in-jvm tests | 
CASSANDRA-16183 |
| ClientRequestSize | {color:#00875A}*COVERED*{color} | unit tests | 
CASSANDRA-16184 |
| Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 |
| CommitLog | {color:#DE350B}*TESTS MISSING*{color} | unit tests | 
CASSANDRA-16185 |
| Compaction | {color:#DE350B}*TESTS MISSING*{color} | unit tests | 
CASSANDRA-16192 |
| CQL | {color:#00875A}*COVERED*{color}| unit tests | |
| HintService | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16189 |
| Messaging/Internode| {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | 
CASSANDRA-16193 |
| ReadRepair| {color:#DE350B}*TESTS MISSING*{color} | dtests,in-jvm dtests | 
CASSANDRA-16187 | 
| Repair | {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16191 |
| Storage | {color:#00875A}*COVERED*{color}| unit tests | |
| Streaming | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16190 |
| Keyspace | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | 
CASSANDRA-16188 |
| Table | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | 
CASSANDRA-16188 |
| ThreadPoolMetrics |{color:#00875A}*COVERED*{color} | unit tests | 
CASSANDRA-16186 |


  was:
The goal of this ticket is to have a proper testing of the different metrics 
exposed via JMX, and to ensure that metrics that are not in used in 4.0 have 
been properly deprecated.

The following table show the current status of the metric tests and can be used 
to track the progress of that ticket:

|| Metrics || Status || test types || JIRA tickets ||
| Batch | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15718 | 
| BufferPool | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15773 |
| Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 |
| Client | {color:#DE350B}*TESTS MISSING*{color}| unit tests | CASSANDRA-16216 |
| ClientRequest |{color:#DE350B}*NO TESTS*{color} | in-jvm tests | 
CASSANDRA-16183 |
| ClientRequestSize | {color:#00875A}*COVERED*{color} | unit tests | 
CASSANDRA-16184 |
| Cache | {color:#00875A}*COVERED*{color} | unit tests | CASSANDRA-15788 |
| CommitLog | {color:#DE350B}*TESTS MISSING*{color} | unit tests | 
CASSANDRA-16185 |
| Compaction | {color:#DE350B}*TESTS MISSING*{color} | unit tests | 
CASSANDRA-16192 |
| CQL | {color:#00875A}*COVERED*{color}| unit tests | |
| HintService | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16189 |
| Messaging/Internode| {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | 
CASSANDRA-16193 |
| ReadRepair| {color:#DE350B}*TESTS MISSING*{color} | dtests,in-jvm dtests | 
CASSANDRA-16187 | 
| Repair | {color:#DE350B}*NO TESTS*{color} | in-jvm dtests | CASSANDRA-16191 |
| Storage | {color:#00875A}*COVERED*{color}| unit tests | |
| Streaming | {color:#DE350B}*NO TESTS*{color} | dtests | CASSANDRA-16190 |
| Keyspace | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | 
CASSANDRA-16188 |
| Table | {color:#DE350B}*TESTS MISSING*{color} | unit tests/in-jvm dtests | 
CASSANDRA-16188 |
| ThreadPoolMetrics |{color:#00875A}*COVERED*{color} | unit tests | 
CASSANDRA-16186 |



> 4.0 quality testing: metrics
> 
>
> Key: CASSANDRA-15582
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15582
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: Screen Shot 2020-04-07 at 5.47.17 PM.png
>
>
> The goal of this ticket is to have a proper testing of the different metrics 
> exposed via JMX, and to ensure that metrics that are not in used in 4.0 have 
> been properly deprecated.
> The following table show the current status of the metric tests and can be 
> used to track the progress of that ticket:
> || Metrics || Status || test types || JIRA tickets ||
> | Batch | {color:#00875A}*COVERED*{color} | unit tests |