from:"Alexander Dejanovski \(JIRA\)"

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2021-03-26 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309294#comment-17309294
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

[~e.dimitrova], this work hasn't been formally reviewed.

There's some flakiness [in the CI 
runs|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk]
 which is due to Medusa and S3 transient failures in downloads. This is being 
addressed in Medusa itself and should be fixed shortly [with this 
PR|https://github.com/thelastpickle/cassandra-medusa/pull/295].

It would be good to have someone validate that the test scenarios were 
implemented according to the description in order to close this ticket.

Then we'll work on having this work integrated into Cassandra (or as a 
subproject) post-4.0.

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2021-03-22 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305923#comment-17305923
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

Thanks [~cscotta]!

We're done with this ticket.
There was some instability in the CI runs lately due to CASSANDRA-16478 which 
was fixed with CASSANDRA-16480 and it seems like it's passing nicely since then.

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7

2021-03-05 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296005#comment-17296005
 ] 

Alexander Dejanovski commented on CASSANDRA-16480:
--

Thanks [~brandon.williams]!

I've tested your branch by pushing [this 
commit|https://github.com/riptano/cassandra-rtest/commit/533b346584512133c782b39731d47c54fa1bb496]
 on previously failing 4.0 repair tests and [they passed 
successfully|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/118/workflows/0ccec847-e942-4486-ad38-750a825b2e7a].

LGTM (y)

> cassandra-builds produce deb packages that require python 3.7
> -
>
> Key: CASSANDRA-16480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Alexander Dejanovski
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Since the builds moved from depending on python 2 to  python 3, the packages 
> that are produced by the [cassandra-builds 
> project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be 
> installed on the target systems:
> {noformat}
> $ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb
> (Reading database ... 117878 files and directories currently installed.)
> Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ...
> Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over 
> (4.0~beta5-20210303git25f3cf84f7) ...
> dpkg: dependency problems prevent configuration of cassandra:
>  cassandra depends on python3 (>= 3.7~); however:
>   Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing 
> package cassandra (--install):
>  dependency problems - leaving unconfigured
> Processing triggers for systemd (237-3ubuntu10.38) ...
> Processing triggers for ureadahead (0.100.0-21) ...
> Errors were encountered while processing:
>  cassandra{noformat}
> The [test docker 
> images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36]
>  ship with both py36 and py38, which allows the install to pass nicely, but 
> on a vanilla Ubuntu Bionic system, only Python 3.6 is installed.
> We need to use debian buster images for builds that ship with python 3.6 so 
> that the dependencies align with it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7

2021-03-03 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294718#comment-17294718
 ] 

Alexander Dejanovski commented on CASSANDRA-16480:
--

Dropping support for 3.6 although it's still supported and is the default in 
Bionic (which will be supported until 2023) doesn't seem like the right move. 
It'll block folks from upgrading to 4.0 unless they upgrade their systems to 
Focal or install python 3.7 which is not that trivial for everyone, especially 
if they have a large fleet and other software that rely on 3.6.

I'd vote for option 1 or 2.

> cassandra-builds produce deb packages that require python 3.7
> -
>
> Key: CASSANDRA-16480
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16480
> Project: Cassandra
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Alexander Dejanovski
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Since the builds moved from depending on python 2 to  python 3, the packages 
> that are produced by the [cassandra-builds 
> project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be 
> installed on the target systems:
> {noformat}
> $ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb
> (Reading database ... 117878 files and directories currently installed.)
> Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ...
> Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over 
> (4.0~beta5-20210303git25f3cf84f7) ...
> dpkg: dependency problems prevent configuration of cassandra:
>  cassandra depends on python3 (>= 3.7~); however:
>   Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing 
> package cassandra (--install):
>  dependency problems - leaving unconfigured
> Processing triggers for systemd (237-3ubuntu10.38) ...
> Processing triggers for ureadahead (0.100.0-21) ...
> Errors were encountered while processing:
>  cassandra{noformat}
> The [test docker 
> images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36]
>  ship with both py36 and py38, which allows the install to pass nicely, but 
> on a vanilla Ubuntu Bionic system, only Python 3.6 is installed.
> We need to use debian buster images for builds that ship with python 3.6 so 
> that the dependencies align with it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7

2021-03-03 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16480:


 Summary: cassandra-builds produce deb packages that require python 
3.7
 Key: CASSANDRA-16480
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16480
 Project: Cassandra
  Issue Type: Bug
Reporter: Alexander Dejanovski


Since the builds moved from depending on python 2 to  python 3, the packages 
that are produced by the [cassandra-builds 
project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be 
installed on the target systems:
{noformat}
$ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb
(Reading database ... 117878 files and directories currently installed.)
Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ...
Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over 
(4.0~beta5-20210303git25f3cf84f7) ...
dpkg: dependency problems prevent configuration of cassandra:
 cassandra depends on python3 (>= 3.7~); however:
  Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing package 
cassandra (--install):
 dependency problems - leaving unconfigured
Processing triggers for systemd (237-3ubuntu10.38) ...
Processing triggers for ureadahead (0.100.0-21) ...
Errors were encountered while processing:
 cassandra{noformat}
The [test docker 
images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36]
 ship with both py36 and py38, which allows the install to pass nicely, but on 
a vanilla Ubuntu Bionic system, only Python 3.6 is installed.

We need to use debian buster images for builds that ship with python 3.6 so 
that the dependencies align with it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16478) Debian packages are broken since py3 migration

2021-03-03 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16478:
-
Description: 
[Repair 
tests|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=alex%2Fupgrade-tlp-cluster-python3]
 started to fail after the builds moved to Python3 in CASSANDRA-16396 due to 
deb packages failing to install on Ubuntu Bionic:
{noformat}
$ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb
Selecting previously unselected package cassandra.
(Reading database ... 117650 files and directories currently installed.)
Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ...
Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ...
dpkg: dependency problems prevent configuration of cassandra:
 cassandra depends on python (>= 3.6); however:
  Package python is not installed.
 cassandra depends on python3 (>= 3.7~); however:
  Version of python3 on system is 3.6.7-1~18.04.{noformat}
It seems like the following requirements are not correct:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), 
${misc:Depends}, ${python3:Depends}{noformat}
I've changed this line to the following and got the deb packages to install 
correctly:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), 
${misc:Depends}{noformat}

  was:
Repair tests started to fail after the builds moved to Python3 in 
CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic:
{noformat}
$ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb
Selecting previously unselected package cassandra.
(Reading database ... 117650 files and directories currently installed.)
Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ...
Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ...
dpkg: dependency problems prevent configuration of cassandra:
 cassandra depends on python (>= 3.6); however:
  Package python is not installed.
 cassandra depends on python3 (>= 3.7~); however:
  Version of python3 on system is 3.6.7-1~18.04.{noformat}
It seems like the following requirements are not correct:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), 
${misc:Depends}, ${python3:Depends}{noformat}
I've changed this line to the following and got the deb packages to install 
correctly:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), 
${misc:Depends}{noformat}


> Debian packages are broken since py3 migration
> --
>
> Key: CASSANDRA-16478
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16478
> Project: Cassandra
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> [Repair 
> tests|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=alex%2Fupgrade-tlp-cluster-python3]
>  started to fail after the builds moved to Python3 in CASSANDRA-16396 due to 
> deb packages failing to install on Ubuntu Bionic:
> {noformat}
> $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb
> Selecting previously unselected package cassandra.
> (Reading database ... 117650 files and directories currently installed.)
> Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ...
> Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ...
> dpkg: dependency problems prevent configuration of cassandra:
>  cassandra depends on python (>= 3.6); however:
>   Package python is not installed.
>  cassandra depends on python3 (>= 3.7~); however:
>   Version of python3 on system is 3.6.7-1~18.04.{noformat}
> It seems like the following requirements are not correct:
> {noformat}
> Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), 
> ${misc:Depends}, ${python3:Depends}{noformat}
> I've changed this line to the following and got the deb packages to install 
> correctly:
> {noformat}
> Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), 
> ${misc:Depends}{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-16478) Debian packages are broken since py3 migration

2021-03-03 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16478:


 Summary: Debian packages are broken since py3 migration
 Key: CASSANDRA-16478
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16478
 Project: Cassandra
  Issue Type: Bug
  Components: Packaging
Reporter: Alexander Dejanovski
Assignee: Alexander Dejanovski


Repair tests started to fail after the builds moved to Python3 in 
CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic:
{noformat}
$ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb
Selecting previously unselected package cassandra.
(Reading database ... 117650 files and directories currently installed.)
Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ...
Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ...
dpkg: dependency problems prevent configuration of cassandra:
 cassandra depends on python (>= 3.6); however:
  Package python is not installed.
 cassandra depends on python3 (>= 3.7~); however:
  Version of python3 on system is 3.6.7-1~18.04.{noformat}
It seems like the following requirements are not correct:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), 
${misc:Depends}, ${python3:Depends}{noformat}
I've changed this line to the following and got the deb packages to install 
correctly:
{noformat}
Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), 
${misc:Depends}{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16478) Debian packages are broken since py3 migration

2021-03-03 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16478:
-
Fix Version/s: 4.0-beta

> Debian packages are broken since py3 migration
> --
>
> Key: CASSANDRA-16478
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16478
> Project: Cassandra
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair tests started to fail after the builds moved to Python3 in 
> CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic:
> {noformat}
> $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb
> Selecting previously unselected package cassandra.
> (Reading database ... 117650 files and directories currently installed.)
> Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ...
> Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ...
> dpkg: dependency problems prevent configuration of cassandra:
>  cassandra depends on python (>= 3.6); however:
>   Package python is not installed.
>  cassandra depends on python3 (>= 3.7~); however:
>   Version of python3 on system is 3.6.7-1~18.04.{noformat}
> It seems like the following requirements are not correct:
> {noformat}
> Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), 
> ${misc:Depends}, ${python3:Depends}{noformat}
> I've changed this line to the following and got the deb packages to install 
> correctly:
> {noformat}
> Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), 
> ${misc:Depends}{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs

2021-02-03 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277996#comment-17277996
 ] 

Alexander Dejanovski commented on CASSANDRA-16244:
--

LGTM [~adelapena] (y)

Thanks!

> Create a jvm upgrade dtest for mixed versions repairs
> -
>
> Key: CASSANDRA-16244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16244
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-rc
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Repair during upgrades should fail on mixed version clusters.
> We'd need an in-jvm upgrade dtest to check that repair indeed fails as 
> expected with mixed current version+previous major version clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-29 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274439#comment-17274439
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Here's a [full green CI 
run|https://github.com/thelastpickle/cassandra-medusa/actions/runs/520478863] 
using the branch from this patch for 4.0.

I've tried with different TLS settings (PROTOCOL_TLSv1 and PROTOCOL_TLSv1_2) 
and it worked in both cases.

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-28 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274193#comment-17274193
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Hey folks, sorry it took me a while to get to the bottom of it.

The issue we were having was due to the [storage port being changed in our 
integration 
tests|https://github.com/thelastpickle/cassandra-medusa/blob/master/tests/integration/features/steps/integration_steps.py#L151]
 which apparently was not making ccm happy with 4.0 as the nodes wouldn't find 
themselves as seeds.

I'm positive that [this worked in the 
past|https://github.com/thelastpickle/cassandra-medusa/runs/1449108534?check_suite_focus=true],
 so I have no clue why it suddenly started failing, nor why it would still pass 
locally on my laptop. There's definitely something fishy with the way some 
versions of ccm (I get lost between which branches do support 4.0 or not) deal 
with changing the storage port and how that impacts the seed list. 

But the good news is that as soon as I removed the storage port change, the 
[tests went 
green|https://github.com/thelastpickle/cassandra-medusa/runs/1789721180?check_suite_focus=true]
 using the C16362 branch (/)

+1 for merge and I'll set up CI properly again in Medusa to get tests running 
on trunk.

I'll try to investigate further the issue with CCM and the custom storage port. 

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-28 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273398#comment-17273398
 ] 

Alexander Dejanovski commented on CASSANDRA-16406:
--

Done, and I attached the updated patch to the ticket.

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16406-2-trunk.txt, CASSANDRA-16406.png, 
> with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Attachment: 16406-2-trunk.txt

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16406-2-trunk.txt, CASSANDRA-16406.png, 
> with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Attachment: (was: 16406-trunk.txt)

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: CASSANDRA-16406.png, with_debug_logging.png, 
> without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-27 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Test and Documentation Plan: 
Here's the CircleCI link for the patched run of the repair quality tests: 
[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/73/workflows/0828fcad-26d2-43d6-9c8c-7a4102b0e31c]

Full and Incremental test runs lasted 30 minutes less than the current trunk.
 Status: Patch Available  (was: In Progress)

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16406-trunk.txt, CASSANDRA-16406.png, 
> with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow

[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-27 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273386#comment-17273386
 ] 

Alexander Dejanovski commented on CASSANDRA-16406:
--

I ran the repair quality test using the [patched 
branch|https://github.com/apache/cassandra/compare/trunk...adejanovski:CASSANDRA-16406?expand=1]
 and got the expected 30 minutes reduction on the full and incremental test 
suites: 

!CASSANDRA-16406.png!

 

[PR|https://github.com/apache/cassandra/pull/881]
 [Branch|https://github.com/adejanovski/cassandra/tree/CASSANDRA-16406]

I've attached the patch.

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16406-trunk.txt, CASSANDRA-16406.png, 
> with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-27 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Attachment: 16406-trunk.txt

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: 16406-trunk.txt, CASSANDRA-16406.png, 
> with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-27 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Attachment: CASSANDRA-16406.png

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: CASSANDRA-16406.png, with_debug_logging.png, 
> without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-27 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273053#comment-17273053
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

It's still failing in CI and passing locally.

I don't have a clue why yet why it doesn't work there and GHA doesn't allow to 
SSH into the CI instances :(

It seems more like a problem with our CI rather than your patch as I get it to 
pass locally.
I'll spend some time on this issue tomorrow and update this ticket.

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios

2021-01-25 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16245:
-
Authors: Alexander Dejanovski, Radovan Zvoncek  (was: 
Radovan Zvoncek)
Test and Documentation Plan: 
Perform repairs for a 3 nodes cluster using m5ad.xlarge instances.
 Repaired keyspaces will use RF=3 or RF=2 (the latter is for subranges with 
different sets of replicas).
||Mode||Version||Settings||Checks||
|Full repair|trunk|Sequential + All token ranges|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Force terminate repair shortly after it was triggered|Repair 
threads must be cleaned up|
|Subrange repair|trunk|Sequential + single token range|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Subrange repair|trunk|Parallel + 10 token ranges which have the same 
replicas|"No anticompaction (repairedAt == 0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
 A single repair session will handle all subranges at once"|
|Subrange repair|trunk|Parallel + 10 token ranges which have different 
replicas|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
 More than one repair session is triggered to process all subranges"|
|Incremental repair|trunk|"Parallel (mandatory)
 No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|"Parallel (mandatory)
 Major compaction triggered during repair"|"Anticompaction status (repairedAt 
!= 0) on all SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|Force terminate repair shortly after it was 
triggered.|Repair threads must be cleaned up|
 Status: Patch Available  (was: In Progress)

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2021-01-25 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271532#comment-17271532
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

*Status update*

CASSANDRA-16245 is close to done with [nightly CI 
runs|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk]
 scheduled already.

CASSANDRA-16244 has a patch available which is under review.

 

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2021-01-25 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271530#comment-17271530
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

Status update:

The test scenarios described in this ticket were implemented and are now 
scheduled for [nightly runs in 
CircleCI|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk]
 against trunk.

We had to reduce the density per node to 20GB for now as the tests take a while 
to run already. We may generate additional data without adding more entropy to 
see how that impacts the execution times.

[One last PR|https://github.com/riptano/cassandra-rtest/pull/4] is waiting to 
be merged to fix the code style and use the Cassandra code conventions, and 
also complement the push triggered CI runs with the CCM based test scenarios 
which are used for development purposes.

[~vinaykumarcse], are you still willing to do a review on the code? I guess it 
can wait until we get a consensus on whether we integrate this repair test to 
the Cassandra repo or not, but I'd be happy to get your feedback already.

 

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs

2021-01-25 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271351#comment-17271351
 ] 

Alexander Dejanovski commented on CASSANDRA-16244:
--

Hi [~adelapena], 

looking at the patch it seems that we could hide the upgraded node behavior by 
timing out waiting for the message to show up each time.

Correct me if I misunderstood, but the current behavior is:
 * Loop through all nodes one after the other
 * Start a repair using nodetool which will timeout after 10s but is expected 
to fail with a specific error message on the upgraded node
 * If no exception was triggered, check that the logs contain the expected 
message
 * Catch the TimeoutException and assume we're dealing with a non upgraded node

Isn't it possible that the assumption we're dealing with a non upgraded node 
when we get a timeout could potentially hide some edge cases where the upgraded 
node doesn't behave as expected and goes into timeout? We could then possibly 
get the test to succeed although we're not getting the expected behavior.

Let me know if I'm missing something.

> Create a jvm upgrade dtest for mixed versions repairs
> -
>
> Key: CASSANDRA-16244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16244
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-rc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Repair during upgrades should fail on mixed version clusters.
> We'd need an in-jvm upgrade dtest to check that repair indeed fails as 
> expected with mixed current version+previous major version clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-25 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271338#comment-17271338
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Awesome findings [~jmeredithco]!

I managed to have the tests pass using your new branch on my laptop but they're 
still failing in CI for some odd reason: 
[https://github.com/thelastpickle/cassandra-medusa/runs/1762379270?check_suite_focus=true]

I'll investigate further to see where the problem lies specifically and update 
here.

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-25 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Fix Version/s: 4.0-rc

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands,

[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-25 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271317#comment-17271317
 ] 

Alexander Dejanovski commented on CASSANDRA-16406:
--

[~spod], seems like you added most of these debug loggings.

Are you ok with me moving them to TRACE level?

> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Attachments: with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB per node dataset was generating 
> more than 2G of log with a total duration for the incremental repair 
> scenarios of ~2h40m: 
> https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
>  ]
>  !with_debug_logging.png!  
>  The logs showed a lot of messages from the MerkleTree class at high pace:
> {noformat}
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
> Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 
> hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] 
> children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> 
> # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, 
> # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] 
> children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
> Hashing sub-ranges [# depth=11>, #] 
> for # divided 
> by midpoint -6738564612709905078
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
> Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, 
> # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
> Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, 
> # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
> Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) 
> Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>]
> DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
> Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 
> hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] 
> children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] 
> children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] 
> children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] 
> children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] 
> children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] 
> children=[#]>]>]{noformat}
> When disabling debug logging, the duration dropped to ~2h05m with decent log 
> sizes:
> [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]
> !without_debug_logging.png!
> There's apparently too much logging for each inconsistency found in the 
> Merkle tree comparisons and we should move this to TRACE level if we still 
> want to allow debug logging to be turned on by default.
> I'll prepare a patch for the MerkleTree class and run the repair testing 
> scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To

[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-25 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16406:
-
Description: 
While working on the repair quality testing in CASSANDRA-16245, it appeared 
that the node coordinating repairs on a 20GB per node dataset was generating 
more than 2G of log with a total duration for the incremental repair scenarios 
of ~2h40m: 
https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
 ]
 !with_debug_logging.png!  
 The logs showed a lot of messages from the MerkleTree class at high pace:
{noformat}
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
Inconsistent digest on right sub-range #: [# 
#]>, 
# 
#]>]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
Hashing sub-ranges [#, #] 
for # divided 
by midpoint -6738564612709905078
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
Inconsistent digest on left sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
Inconsistent digest on right sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully 
inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat}
When disabling debug logging, the duration dropped to ~2h05m with decent log 
sizes:

[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]

!without_debug_logging.png!

There's apparently too much logging for each inconsistency found in the Merkle 
tree comparisons and we should move this to TRACE level if we still want to 
allow debug logging to be turned on by default.

I'll prepare a patch for the MerkleTree class and run the repair testing 
scenarios again to verify their duration.

  was:
While working on the repair quality testing in CASSANDRA-16245, it appeared 
that the node coordinating repairs on a 20GB per node dataset was generating 
more than 2G of log with a total duration for the incremental repair scenarios 
of ~2h40m: 
[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
]
!with_debug_logging.png!  
The logs showed a lot of messages from the MerkleTree class at high pace:
{noformat}
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
Inconsistent digest on right sub-range #: [# 
#]>, 
# 
#]>]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
Hashing sub-ranges [#, #] 
for # divided 
by midpoint -6738564612709905078
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
Inconsistent digest on left sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
Inconsistent digest on right sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully 
inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat}
When disabling debug logging, the duration dropped to ~2h05m with decent log 
sizes:

[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]

!without_debug_logging.png!

There's apparently too much logging for each inconsistency found in the Merkle 
tree comparisons and we should move this to TRACE level if we still want to 
allow debug logging to be turned on by default.

I'll prepare a patch for the MerkleTree class and run the repair testing 
scenarios again to verify their duration.


> Debug logging affects repair performance
> 
>
> Key: CASSANDRA-16406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Attachments: with_debug_logging.png, without_debug_logging.png
>
>
> While working on the repair quality testing in CASSANDRA-16245, it appeared 
> that the node coordinating repairs on a 20GB

[jira] [Created] (CASSANDRA-16406) Debug logging affects repair performance

2021-01-25 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16406:


 Summary: Debug logging affects repair performance
 Key: CASSANDRA-16406
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16406
 Project: Cassandra
  Issue Type: Bug
  Components: Consistency/Repair
Reporter: Alexander Dejanovski
Assignee: Alexander Dejanovski
 Attachments: with_debug_logging.png, without_debug_logging.png

While working on the repair quality testing in CASSANDRA-16245, it appeared 
that the node coordinating repairs on a 20GB per node dataset was generating 
more than 2G of log with a total duration for the incremental repair scenarios 
of ~2h40m: 
[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
]
!with_debug_logging.png!  
The logs showed a lot of messages from the MerkleTree class at high pace:
{noformat}
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) 
Inconsistent digest on right sub-range #: [# 
#]>, 
# 
#]>]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) 
Hashing sub-ranges [#, #] 
for # divided 
by midpoint -6738564612709905078
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) 
Inconsistent digest on left sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) 
Inconsistent digest on right sub-range #: [#, 
#]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) 
Fully inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully 
inconsistent range [#, #]
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) 
Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat}
When disabling debug logging, the duration dropped to ~2h05m with decent log 
sizes:

[https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51]

!without_debug_logging.png!

There's apparently too much logging for each inconsistency found in the Merkle 
tree comparisons and we should move this to TRACE level if we still want to 
allow debug logging to be turned on by default.

I'll prepare a patch for the MerkleTree class and run the repair testing 
scenarios again to verify their duration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios

2021-01-22 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16245:
-
Description: 
Implement the following test scenarios in a new test suite for repair 
integration testing with significant load:

Generate/restore a workload of ~100GB per node. Medusa should be considered to 
create the initial backup which could then be restored from an S3 bucket to 
speed up node population.
 Data should on purpose require repair and be generated accordingly.

Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
(m5d.xlarge instances would be the most cost efficient type).
 Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
subranges with different sets of replicas).
||Mode||Version||Settings||Checks||
|Full repair|trunk|Sequential + All token ranges|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Force terminate repair shortly after it was triggered|Repair 
threads must be cleaned up|
|Subrange repair|trunk|Sequential + single token range|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Subrange repair|trunk|Parallel + 10 token ranges which have the same 
replicas|"No anticompaction (repairedAt == 0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
A single repair session will handle all subranges at once"|
|Subrange repair|trunk|Parallel + 10 token ranges which have different 
replicas|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
More than one repair session is triggered to process all subranges"|
|Incremental repair|trunk|"Parallel (mandatory)
 No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|"Parallel (mandatory)
 Major compaction triggered during repair"|"Anticompaction status (repairedAt 
!= 0) on all SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|Force terminate repair shortly after it was 
triggered.|Repair threads must be cleaned up|

  was:
Implement the following test scenarios in a new test suite for repair 
integration testing with significant load:

Generate/restore a workload of ~100GB per node. Medusa should be considered to 
create the initial backup which could then be restored from an S3 bucket to 
speed up node population.
 Data should on purpose require repair and be generated accordingly.

Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
(m5d.xlarge instances would be the most cost efficient type).
 Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
subranges with different sets of replicas).
||Mode||Version||Settings||Checks||
|Full repair|trunk|Sequential + All token ranges|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Force terminate repair shortly after it was triggered|Repair 
threads must be cleaned up|
|Subrange repair|trunk|Sequential + single token range|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Subrange repair|trunk|Parallel + 10 token ranges which have the same 
replicas|"No anticompaction (repairedAt == 0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
A single repair session will handle all subranges at once"|
|Subrange repair|trunk|Parallel + 10 token ranges which have different 
replicas|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
More than one repair session is triggered to process all subranges"|
|Subrange repair|trunk|"Single token range.
 Force terminate repair shortly after it was triggered."|Repair threads must be 
cleaned up|
|Incremental repair|trunk|"Parallel (mandatory)
 No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|

[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs

2021-01-21 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269392#comment-17269392
 ] 

Alexander Dejanovski commented on CASSANDRA-16244:
--

Thanks for picking this up [~adelapena]! :)

> Create a jvm upgrade dtest for mixed versions repairs
> -
>
> Key: CASSANDRA-16244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16244
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Repair during upgrades should fail on mixed version clusters.
> We'd need an in-jvm upgrade dtest to check that repair indeed fails as 
> expected with mixed current version+previous major version clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-20 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268819#comment-17268819
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

It's our fault actually. Please use the following branch of Medusa to be able 
to use forks as Cassandra base for the ccm clusters: 
https://github.com/thelastpickle/cassandra-medusa/tree/alex/CASSANDRA-16362

The master branch will only accept {{github:apache/...}} versions.

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-20 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268732#comment-17268732
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Hi [~jmeredithco], 

very sorry for not responding earlier, I'm heads down on 4.0 repair quality 
testing at the moment.
A colleague of mine is working on giving you steps to reproduce the issue with 
ccm and will comment here soon with the instructions.
For Medusa integration tests, there have been issues with the sstableloader 
test (scenario 11) which was fixed by CASSANDRA-16280.
I manage to get the scenario 11 passing with beta3:

{code:java}
(py36) adejanovski@mac-alex-2 cassandra-medusa % ./run_integration_tests.sh -t 
11 --cassandra-version=4.0-beta3
...
...
  @11 @local
  Scenario Outline: Perform a backup, and restore it using the sstableloader -- 
@1.1 Local storage  # integration/features/integration_tests.feature:450
Given I have a fresh ccm cluster "with_client_encryption" running named 
"scenario11"# features/steps/integration_steps.py:125
Given I have a fresh ccm cluster "with_client_encryption" running named 
"scenario11"# features/steps/integration_steps.py:125 22.497s
Given I am using "local" as storage provider in ccm cluster 
"with_client_encryption"# features/steps/integration_steps.py:235 
0.052s
When I create the "test" table in keyspace "medusa" 
# features/steps/integration_steps.py:511 0.122s
When I load 100 rows in the "medusa.test" table 
# features/steps/integration_steps.py:534 0.192s
When I run a "ccm node1 nodetool flush" command 
# features/steps/integration_steps.py:542 1.508s
When I load 100 rows in the "medusa.test" table 
# features/steps/integration_steps.py:534 0.160s
When I run a "ccm node1 nodetool flush" command 
# features/steps/integration_steps.py:542 1.445s
When I perform a backup in "full" mode of the node named "first_backup" 
# features/steps/integration_steps.py:547 3.208s
Then I can see the backup named "first_backup" when I list the backups  
# features/steps/integration_steps.py:591 0.014s
Then I can verify the backup named "first_backup" successfully  
# features/steps/integration_steps.py:655 0.029s
When I load 100 rows in the "medusa.test" table 
# features/steps/integration_steps.py:534 0.135s
When I run a "ccm node1 nodetool flush" command 
# features/steps/integration_steps.py:542 1.373s
Then I have 300 rows in the "medusa.test" table in ccm cluster 
"with_client_encryption" # features/steps/integration_steps.py:766 
0.119s
When I truncate the "medusa.test" table in ccm cluster 
"with_client_encryption" # 
features/steps/integration_steps.py:1040 0.167s
When I restore the backup named "first_backup" with the sstableloader   
# features/steps/integration_steps.py:734 20.214s
Then I have 200 rows in the "medusa.test" table in ccm cluster 
"with_client_encryption" # features/steps/integration_steps.py:766 
0.079s
...
...
1 feature passed, 0 failed, 0 skipped
1 scenario passed, 0 failed, 59 skipped
16 steps passed, 0 failed, 1039 skipped, 0 undefined
Took 0m51.312s
{code}

But it fails with beta4 due to the issue reported in this very ticket: 

{noformat}
./run_integration_tests.sh -t 11 --cassandra-version=4.0-beta4
...
...
  @11 @local
  Scenario Outline: Perform a backup, and restore it using the sstableloader -- 
@1.1 Local storage  # integration/features/integration_tests.feature:450
Given I have a fresh ccm cluster "with_client_encryption" running named 
"scenario11"# features/steps/integration_steps.py:125
Given I have a fresh ccm cluster "with_client_encryption" running named 
"scenario11"# features/steps/integration_steps.py:125 22.836s
Given I am using "local" as storage provider in ccm cluster 
"with_client_encryption"# features/steps/integration_steps.py:235 
0.053s
When I create the "test" table in keyspace "medusa" 
# features/steps/integration_steps.py:511 0.113s
When I load 100 rows in the "medusa.test" table 
# features/steps/integration_steps.py:534 0.229s
When I run a "ccm node1 nodetool flush" command 
# features/steps/integration_steps.py:542 1.424s
When I load 100 rows in the "medusa.test" table

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-11 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262634#comment-17262634
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

[~jmeredithco], it's now failing earlier in the process as I can't get the 
Python CQL driver to connect to Cassandra when encryption is turned on. I've 
tried using TLS 1.1 and TLS 1.2 with the same result...
Previously we were failing later in the process as we could connect using the 
Python driver but failed at using the sstableloader.

Any idea of what could be preventing us from connecting with TLS 1.1 and 1.2 
when using the python driver?
Our implementation follows what's described in [this driver documentation 
page|https://docs.datastax.com/en/developer/python-driver/3.24/security/#ssl-configuration-examples].



> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2021-01-08 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261431#comment-17261431
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Sure thing, I'll run a test ASAP.

> SSLFactory should initialize SSLContext before setting protocols
> 
>
> Key: CASSANDRA-16362
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16362
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Erik Merkle
>Assignee: Jon Meredith
>Priority: Normal
> Fix For: 4.0-beta5
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Trying to use sstableloader from the latest trunk produced the following 
> Exception:
> {quote}
> Exception in thread "main" java.lang.RuntimeException: Could not create SSL 
> Context.
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261)
>   at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64)
>   at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49)
> Caused by: java.io.IOException: Error creating/initializing the SSL Context
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184)
>   at 
> org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257)
>   ... 2 more
> Caused by: java.lang.IllegalStateException: SSLContext is not initialized
>   at 
> sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208)
>   at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158)
>   at 
> javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184)
>   at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435)
>   at 
> org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178)
>   ... 3 more
> {quote}
> I believe this is because of a change to SSLFactory for CASSANDRA-13325 here:
> [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178]
>  
> I think the solution is to call {{ctx.init()}} before trying to call 
> {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the 
> link above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols

2020-12-22 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253347#comment-17253347
 ] 

Alexander Dejanovski commented on CASSANDRA-16362:
--

Hi [~jmeredithco],

thanks for issuing a patch.
I tested it with Medusa's integration tests and now get the following error:

{noformat}
WARN  09:57:44,993 Failed to initialize a channel. Closing: [id: 0x61e6eef5]
java.lang.IllegalArgumentException: TLSv1.3
at sun.security.ssl.ProtocolVersion.valueOf(ProtocolVersion.java:187)
at sun.security.ssl.ProtocolList.convert(ProtocolList.java:84)
at sun.security.ssl.ProtocolList.(ProtocolList.java:52)
at 
sun.security.ssl.SSLEngineImpl.setEnabledProtocols(SSLEngineImpl.java:2081)
at 
org.apache.cassandra.tools.BulkLoader$1.newSSLEngine(BulkLoader.java:276)
at 
com.datastax.driver.core.RemoteEndpointAwareJdkSSLOptions.newSSLHandler(RemoteEndpointAwareJdkSSLOptions.java:62)
at 
com.datastax.driver.core.Connection$Initializer.initChannel(Connection.java:1700)
at 
com.datastax.driver.core.Connection$Initializer.initChannel(Connection.java:1644)
at 
com.datastax.shaded.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:113)
at 
com.datastax.shaded.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:105)
at 
com.datastax.shaded.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:593)
at 
com.datastax.shaded.netty.channel.DefaultChannelPipeline.access$000(DefaultChannelPipeline.java:44)
at 
com.datastax.shaded.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1357)
at 
com.datastax.shaded.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1092)
at 
com.datastax.shaded.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:642)
at 
com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:456)
at 
com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:378)
at 
com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:428)
at 
com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at 
com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
at 
com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at 
com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 
(com.datastax.driver.core.exceptions.TransportException: 
[localhost/127.0.0.1:9042] Cannot connect))
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried 
for query failed (tried: localhost/127.0.0.1:9042 
(com.datastax.driver.core.exceptions.TransportException: 
[localhost/127.0.0.1:9042] Cannot connect))
at 
com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:268)
at 
com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:107)
at 
com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect(Cluster.java:1813)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1726)
at com.datastax.driver.core.Cluster.init(Cluster.java:214)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:387)
at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:366)
at com.datastax.driver.core.Cluster.connect(Cluster.java:311)
at 
org.apache.cassandra.utils.NativeSSTableLoaderClient.init(NativeSSTableLoaderClient.java:75)
at 
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:183)
at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:79)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:51)
{noformat}

Here's the sstableloader command that is being issued:

{noformat}
subprocess.CalledProcessError: Command 
'['/Users/adejanovski/.ccm/repository/githubCOLONjonmeredithSLASHC16362/bin/sstableloader',
 '-d', '127.0.0.1', '--conf-path', 
'/Users/adejanovski/.ccm/scenario11/node1/conf/cassandra.yaml', '--username', 
'cassandra', '--password', 'cassandra', '--no-progress', 
'/tmp/medusa-restore-97ec3e11-426a-4924-8bc0-379e99ff2205/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627',
 '-ts', 
'/Users/adejanovski/projets/cassandra/thelastpickle/cassandra-medusa/tests/resources/local_with_ssl/generic-server-truststore.jks',
 '-tspw', 'truststorePass1',

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-12-10 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247255#comment-17247255
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

Thanks for the feedback [~vinaykumarcse]!

We made some progress in CASSANDRA-16245 with an implementation of the test 
scenarios using Cucumber, and ccm to spin up a cluster.
We're in the process of wiring it up with tlp-cluster to work on an actual AWS 
cluster instead for 100 GB per node density testing.

Hopefully we'll have a first fully running version next week.

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2020-12-10 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247249#comment-17247249
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

Hi [~zvo], 

Awesome stuff so far!

I've pushed a GitHub Actions workflow which spins up/tears down a 3 node 
cluster in AWS using m5ad.xlarge instances (4 vCPUs, 16G RAM and 150GB of 
direct attached storage).
They provide a 140GB SSD drive which is mounted as {{/var/lib/cassandra}} by 
tlp-cluster.
Let's start with a dataset of 100GB per node for our testing, which should be 
good enough for now.

The test suite needs to be adjusted to target the "real" cluster instead of a 
ccm one, and tlp-cluster provides environment variables with each node's public 
IP in the {{env.sh}} file ({{source env.sh}} sets the variables along with the 
other tlp-cluster aliases).

Could you rename the branch you're working on to {{CASSANDRA-16245}}?

Let me know if you have what you need to move this forward.

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Subrange repair|trunk|"Single token range.
>  Force terminate repair shortly after it was triggered."|Repair threads must 
> be cleaned up|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-11-18 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234403#comment-17234403
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

Sounds good [~marcuse], thanks for the notice.

Work started on CASSANDRA-16245 to implement the new test suite.
If anyone's interested in picking up CASSANDRA-16244 it would be greatly 
appreciated!

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2020-11-18 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234388#comment-17234388
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

Dev repo was created here for anyone interested: 
https://github.com/riptano/cassandra-rtest

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Subrange repair|trunk|"Single token range.
>  Force terminate repair shortly after it was triggered."|Repair threads must 
> be cleaned up|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios

2020-11-18 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234384#comment-17234384
 ] 

Alexander Dejanovski commented on CASSANDRA-16245:
--

[~zvo], I'll write up the Gherkin files with the test scenarios so that you can 
implement the test steps.
As agreed upon, we can work in a separate repo for initial devs and integrate 
the code into the Cassandra repo once we have something to show.

 

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Subrange repair|trunk|"Single token range.
>  Force terminate repair shortly after it was triggered."|Repair threads must 
> be cleaned up|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem

2020-11-18 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234355#comment-17234355
 ] 

Alexander Dejanovski commented on CASSANDRA-15584:
--

CASSANDRA-16280 was committed to fix the sstableloader issues in Cassandra and 
the Medusa PR fixing the tests was merged.
We're done with Medusa (table updated).

> 4.0 quality testing: Tooling - External Ecosystem
> -
>
> Key: CASSANDRA-15584
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15584
> Project: Cassandra
>  Issue Type: Task
>  Components: Tool/external
>Reporter: Josh McKenzie
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Benjamin Lerer*
> Many users of Apache Cassandra employ open source tooling to automate 
> Cassandra configuration, runtime management, and repair scheduling. Prior to 
> release, we need to confirm that popular third-party tools function properly. 
> Current list of tools:
> || Name || Status || Contact ||
> | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
> ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
> | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | 
> *NOT STARTED* | [~stefan.miklosovic]| 
> | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| 
> *NOT STARTED* | [~stefan.miklosovic]|
> | [Instaclustr Cassandra 
> operator|https://github.com/instaclustr/cassandra-operator]|  
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Instaclustr Esop | 
> https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} 
> | [~stefan.miklosovic]|
> | [Instaclustr Icarus | 
> https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color}
>  | [~stefan.miklosovic]|
> | [Cassandra SSTable generator | 
> https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
>  [~stefan.miklosovic]|
> | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
> {color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
> | [Cassandra Everywhere Strategy | 
> https://github.com/instaclustr/cassandra-everywhere-strategy] | 
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Cassandra LDAP Authenticator | 
> https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} 
> | [~stefan.miklosovic]|
> | [Instaclustr Minotaur | 
> https://github.com/instaclustr/instaclustr-minotaur] | 
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
> [~adejanovski]|
> | [Medusa|https://github.com/thelastpickle/cassandra-medusa]|  
> {color:#00875A}*DONE*{color}| [~adejanovski]|
> | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| 
> Franck Dehay|
> | 
> [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]|
>  {color:#00875A}*DONE*{color}| [~jtgrabowski]|
> | [cass operator|https://github.com/datastax/cass-operator]| 
> {color:#00875A}*DONE*{color}| [~jimdickinson]|
> | [metric 
> collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|
> | [managment 
> API|https://github.com/datastax/management-api-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|  
> Columns descriptions:
> * *Name*: Name and link to the tool official page
> * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any 
> issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if 
> testing 4.0 is part of your CI process.
> * *Contact*: The person acting as the contact point for that tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem

2020-11-18 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15584:
-
Description: 
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: Benjamin Lerer*

Many users of Apache Cassandra employ open source tooling to automate Cassandra 
configuration, runtime management, and repair scheduling. Prior to release, we 
need to confirm that popular third-party tools function properly. 

Current list of tools:
|| Name || Status || Contact ||
| [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
| [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT 
STARTED* | [~stefan.miklosovic]| 
| [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT 
STARTED* | [~stefan.miklosovic]|
| [Instaclustr Cassandra 
operator|https://github.com/instaclustr/cassandra-operator]|  
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Instaclustr Esop | 
https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} | 
[~stefan.miklosovic]|
| [Instaclustr Icarus | 
https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} 
| [~stefan.miklosovic]|
| [Cassandra SSTable generator | 
https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
 [~stefan.miklosovic]|
| [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
{color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
| [Cassandra Everywhere Strategy | 
https://github.com/instaclustr/cassandra-everywhere-strategy] | 
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Cassandra LDAP Authenticator | 
https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} | 
[~stefan.miklosovic]|
| [Instaclustr Minotaur | https://github.com/instaclustr/instaclustr-minotaur] 
| {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
[~adejanovski]|
| [Medusa|https://github.com/thelastpickle/cassandra-medusa]|  
{color:#00875A}*DONE*{color}| [~adejanovski]|
| [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck 
Dehay|
| 
[spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]|
 {color:#00875A}*DONE*{color}| [~jtgrabowski]|
| [cass operator|https://github.com/datastax/cass-operator]| 
{color:#00875A}*DONE*{color}| [~jimdickinson]|
| [metric 
collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| 
{color:#00875A}*DONE*{color}| [~tjake]|
| [managment 
API|https://github.com/datastax/management-api-for-apache-cassandra]| 
{color:#00875A}*DONE*{color}| [~tjake]|  

Columns descriptions:
* *Name*: Name and link to the tool official page
* *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any issue 
and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if testing 4.0 is 
part of your CI process.
* *Contact*: The person acting as the contact point for that tool. 

  was:
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: Benjamin Lerer*

Many users of Apache Cassandra employ open source tooling to automate Cassandra 
configuration, runtime management, and repair scheduling. Prior to release, we 
need to confirm that popular third-party tools function properly. 

Current list of tools:
|| Name || Status || Contact ||
| [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
| [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT 
STARTED* | [~stefan.miklosovic]| 
| [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT 
STARTED* | [~stefan.miklosovic]|
| [Instaclustr Cassandra 
operator|https://github.com/instaclustr/cassandra-operator]|  
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Instaclustr Esop | 
https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} | 
[~stefan.miklosovic]|
| [Instaclustr Icarus | 
https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} 
| [~stefan.miklosovic]|
| [Cassandra SSTable generator | 
https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
 [~stefan.miklosovic]|
| [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
{color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
| [Cassandra Everywhere Strategy | 
https://github.com/instaclustr/cassandra-everywhere-strategy] | 
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Cassandra LDAP Authenticator | 
https://github.com/instaclustr/cassandra-ldap] |

[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem

2020-11-17 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233676#comment-17233676
 ] 

Alexander Dejanovski commented on CASSANDRA-15584:
--

I found two problems when investigating the failing tests with Medusa:
 * pre-4.0 Cassandra was apparently more permissive with missing ciphers when 
sstableloader was invoked. I've added the path to cassandra.yaml in the 
sstableloader call which fixed the issue (PR pending merge).
 * [CASSANDRA-16144|https://issues.apache.org/jira/browse/CASSANDRA-16144] 
recently introduced a bug in parsing loader options with encryption args. I've 
created [CASSANDRA-16280|https://issues.apache.org/jira/browse/CASSANDRA-16280] 
to track the issue along with a fix.

I'll update the table in this ticket's description once trunk is fixed and the 
Medusa PR gets merged.

> 4.0 quality testing: Tooling - External Ecosystem
> -
>
> Key: CASSANDRA-15584
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15584
> Project: Cassandra
>  Issue Type: Task
>  Components: Tool/external
>Reporter: Josh McKenzie
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Benjamin Lerer*
> Many users of Apache Cassandra employ open source tooling to automate 
> Cassandra configuration, runtime management, and repair scheduling. Prior to 
> release, we need to confirm that popular third-party tools function properly. 
> Current list of tools:
> || Name || Status || Contact ||
> | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
> ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
> | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | 
> *NOT STARTED* | [~stefan.miklosovic]| 
> | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| 
> *NOT STARTED* | [~stefan.miklosovic]|
> | [Instaclustr Cassandra 
> operator|https://github.com/instaclustr/cassandra-operator]|  
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Instaclustr Esop | 
> https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} 
> | [~stefan.miklosovic]|
> | [Instaclustr Icarus | 
> https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color}
>  | [~stefan.miklosovic]|
> | [Cassandra SSTable generator | 
> https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
>  [~stefan.miklosovic]|
> | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
> {color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
> | [Cassandra Everywhere Strategy | 
> https://github.com/instaclustr/cassandra-everywhere-strategy] | 
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Cassandra LDAP Authenticator | 
> https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} 
> | [~stefan.miklosovic]|
> | [Instaclustr Minotaur | 
> https://github.com/instaclustr/instaclustr-minotaur] | 
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
> [~adejanovski]|
> | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *IN PROGRESS*| 
> [~adejanovski]|
> | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| 
> Franck Dehay|
> | 
> [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]|
>  {color:#00875A}*DONE*{color}| [~jtgrabowski]|
> | [cass operator|https://github.com/datastax/cass-operator]| 
> {color:#00875A}*DONE*{color}| [~jimdickinson]|
> | [metric 
> collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|
> | [managment 
> API|https://github.com/datastax/management-api-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|  
> Columns descriptions:
> * *Name*: Name and link to the tool official page
> * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any 
> issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if 
> testing 4.0 is part of your CI process.
> * *Contact*: The person acting as the contact point for that tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144

2020-11-17 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16280:
-
Test and Documentation Plan: 
Regression test added under {{LoaderOptionsTest.testEncryptionSettings}}, 
invoking {{LoaderOptions.builder().parseArgs()}} with all the encryption 
options. 

Failure with the current trunk:

{code:java}
test:
 [echo] Number of test runners: 3
[mkdir] Created dir: 
/Users/adejanovski/projets/cassandra/thelastpickle/cassandra/build/test/cassandra
[mkdir] Created dir: 
/Users/adejanovski/projets/cassandra/thelastpickle/cassandra/build/test/output
[junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest
[junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest Tests 
run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0,824 sec
[junit-timeout]
[junit-timeout] Testcase: 
testEncryptionSettings(org.apache.cassandra.tools.LoaderOptionsTest): Caused an 
ERROR
[junit-timeout] EncryptionOptions cannot be changed after configuration applied
[junit-timeout] java.lang.IllegalStateException: EncryptionOptions cannot be 
changed after configuration applied
[junit-timeout] at 
org.apache.cassandra.config.EncryptionOptions.ensureConfigNotApplied(EncryptionOptions.java:162)
[junit-timeout] at 
org.apache.cassandra.config.EncryptionOptions.applyConfig(EncryptionOptions.java:130)
[junit-timeout] at 
org.apache.cassandra.tools.LoaderOptions$Builder.parseArgs(LoaderOptions.java:478)
[junit-timeout] at 
org.apache.cassandra.tools.LoaderOptionsTest.testEncryptionSettings(LoaderOptionsTest.java:55)
[junit-timeout]
[junit-timeout]
[junit-timeout] Test org.apache.cassandra.tools.LoaderOptionsTest FAILED
{code}

The test passes with the patch:

{code:java}
test:
 [echo] Number of test runners: 3
[junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest
[junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest Tests 
run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0,5 sec

BUILD SUCCESSFUL
{code}


 Status: Patch Available  (was: In Progress)

Here's the patch:
 * [branch|https://github.com/thelastpickle/cassandra/tree/CASSANDRA-16280]
 * 
[commit|https://github.com/thelastpickle/cassandra/commit/dbce40a06d89c415cbe172e4726b6c4bb38fe4c9]

I'm waiting for the build to go through in CircleCI.

> SSTableLoader will fail if encryption parameters are used due to 
> CASSANDRA-16144
> 
>
> Key: CASSANDRA-16280
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16280
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-16144 recently introduced [repeated calls 
> |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to
>  _clientEncOptions.applyConfig()_ for each encryption parameter passed to the 
> sstableloader command line.
> This consistently fails because _applyConfig()_ can be called only once due 
> to the _ensureConfigNotApplied()_ check at the beginning of the method.
> This call is not necessary since the _with...()_ methods will invoke 
> _applyConfig()_ each time:
> {code:java}
> public EncryptionOptions withTrustStore(String truststore)
> {
> return new EncryptionOptions(keystore, keystore_password, truststore, 
> truststore_password, cipher_suites,
> protocol, algorithm, store_type, 
> require_client_auth, require_endpoint_verification,
> enabled, optional).applyConfig();
> }
> {code}
> I'll build a patch for this with the appropriate unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144

2020-11-17 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16280:
-
 Bug Category: Parent values: Availability(12983)Level 1 values: Process 
Crash(12992)
   Complexity: Normal
Discovered By: User Report
 Severity: Critical
   Status: Open  (was: Triage Needed)

> SSTableLoader will fail if encryption parameters are used due to 
> CASSANDRA-16144
> 
>
> Key: CASSANDRA-16280
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16280
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-16144 recently introduced [repeated calls 
> |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to
>  _clientEncOptions.applyConfig()_ for each encryption parameter passed to the 
> sstableloader command line.
> This consistently fails because _applyConfig()_ can be called only once due 
> to the _ensureConfigNotApplied()_ check at the beginning of the method.
> This call is not necessary since the _with...()_ methods will invoke 
> _applyConfig()_ each time:
> {code:java}
> public EncryptionOptions withTrustStore(String truststore)
> {
> return new EncryptionOptions(keystore, keystore_password, truststore, 
> truststore_password, cipher_suites,
> protocol, algorithm, store_type, 
> require_client_auth, require_endpoint_verification,
> enabled, optional).applyConfig();
> }
> {code}
> I'll build a patch for this with the appropriate unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144

2020-11-17 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16280:
-
Fix Version/s: 4.0-beta

> SSTableLoader will fail if encryption parameters are used due to 
> CASSANDRA-16144
> 
>
> Key: CASSANDRA-16280
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16280
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-16144 recently introduced [repeated calls 
> |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to
>  _clientEncOptions.applyConfig()_ for each encryption parameter passed to the 
> sstableloader command line.
> This consistently fails because _applyConfig()_ can be called only once due 
> to the _ensureConfigNotApplied()_ check at the beginning of the method.
> This call is not necessary since the _with...()_ methods will invoke 
> _applyConfig()_ each time:
> {code:java}
> public EncryptionOptions withTrustStore(String truststore)
> {
> return new EncryptionOptions(keystore, keystore_password, truststore, 
> truststore_password, cipher_suites,
> protocol, algorithm, store_type, 
> require_client_auth, require_endpoint_verification,
> enabled, optional).applyConfig();
> }
> {code}
> I'll build a patch for this with the appropriate unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144

2020-11-17 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16280:
-
Component/s: Tool/bulk load

> SSTableLoader will fail if encryption parameters are used due to 
> CASSANDRA-16144
> 
>
> Key: CASSANDRA-16280
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16280
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
>
> CASSANDRA-16144 recently introduced [repeated calls 
> |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to
>  _clientEncOptions.applyConfig()_ for each encryption parameter passed to the 
> sstableloader command line.
> This consistently fails because _applyConfig()_ can be called only once due 
> to the _ensureConfigNotApplied()_ check at the beginning of the method.
> This call is not necessary since the _with...()_ methods will invoke 
> _applyConfig()_ each time:
> {code:java}
> public EncryptionOptions withTrustStore(String truststore)
> {
> return new EncryptionOptions(keystore, keystore_password, truststore, 
> truststore_password, cipher_suites,
> protocol, algorithm, store_type, 
> require_client_auth, require_endpoint_verification,
> enabled, optional).applyConfig();
> }
> {code}
> I'll build a patch for this with the appropriate unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144

2020-11-17 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16280:


 Summary: SSTableLoader will fail if encryption parameters are used 
due to CASSANDRA-16144
 Key: CASSANDRA-16280
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16280
 Project: Cassandra
  Issue Type: Bug
Reporter: Alexander Dejanovski
Assignee: Alexander Dejanovski


CASSANDRA-16144 recently introduced [repeated calls 
|https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to
 _clientEncOptions.applyConfig()_ for each encryption parameter passed to the 
sstableloader command line.
This consistently fails because _applyConfig()_ can be called only once due to 
the _ensureConfigNotApplied()_ check at the beginning of the method.

This call is not necessary since the _with...()_ methods will invoke 
_applyConfig()_ each time:
{code:java}
public EncryptionOptions withTrustStore(String truststore)
{
return new EncryptionOptions(keystore, keystore_password, truststore, 
truststore_password, cipher_suites,
protocol, algorithm, store_type, 
require_client_auth, require_endpoint_verification,
enabled, optional).applyConfig();
}
{code}
I'll build a patch for this with the appropriate unit test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-16245) Implement repair quality test scenarios

2020-11-05 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski reassigned CASSANDRA-16245:


Assignee: Radovan Zvoncek

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Assignee: Radovan Zvoncek
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Subrange repair|trunk|"Single token range.
>  Force terminate repair shortly after it was triggered."|Repair threads must 
> be cleaned up|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios

2020-11-05 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16245:
-
Fix Version/s: 4.0-rc

> Implement repair quality test scenarios
> ---
>
> Key: CASSANDRA-16245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Implement the following test scenarios in a new test suite for repair 
> integration testing with significant load:
> Generate/restore a workload of ~100GB per node. Medusa should be considered 
> to create the initial backup which could then be restored from an S3 bucket 
> to speed up node population.
>  Data should on purpose require repair and be generated accordingly.
> Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
> (m5d.xlarge instances would be the most cost efficient type).
>  Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
> subranges with different sets of replicas).
> ||Mode||Version||Settings||Checks||
> |Full repair|trunk|Sequential + All token ranges|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Full repair|trunk|Force terminate repair shortly after it was 
> triggered|Repair threads must be cleaned up|
> |Subrange repair|trunk|Sequential + single token range|"No anticompaction 
> (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have the same 
> replicas|"No anticompaction (repairedAt == 0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> A single repair session will handle all subranges at once"|
> |Subrange repair|trunk|Parallel + 10 token ranges which have different 
> replicas|"No anticompaction (repairedAt==0)
>  Out of sync ranges > 0
>  Subsequent run must show no out of sync range
> More than one repair session is triggered to process all subranges"|
> |Subrange repair|trunk|"Single token range.
>  Force terminate repair shortly after it was triggered."|Repair threads must 
> be cleaned up|
> |Incremental repair|trunk|"Parallel (mandatory)
>  No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
> SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|"Parallel (mandatory)
>  Major compaction triggered during repair"|"Anticompaction status (repairedAt 
> != 0) on all SSTables
>  No pending repair on SSTables after completion (could require to wait a bit 
> as this will happen asynchronously)
>  Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
> |Incremental repair|trunk|Force terminate repair shortly after it was 
> triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-16245) Implement repair quality test scenarios

2020-11-05 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16245:


 Summary: Implement repair quality test scenarios
 Key: CASSANDRA-16245
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16245
 Project: Cassandra
  Issue Type: Task
  Components: Test/dtest/java
Reporter: Alexander Dejanovski


Implement the following test scenarios in a new test suite for repair 
integration testing with significant load:

Generate/restore a workload of ~100GB per node. Medusa should be considered to 
create the initial backup which could then be restored from an S3 bucket to 
speed up node population.
 Data should on purpose require repair and be generated accordingly.

Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM 
(m5d.xlarge instances would be the most cost efficient type).
 Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
subranges with different sets of replicas).
||Mode||Version||Settings||Checks||
|Full repair|trunk|Sequential + All token ranges|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Full repair|trunk|Force terminate repair shortly after it was triggered|Repair 
threads must be cleaned up|
|Subrange repair|trunk|Sequential + single token range|"No anticompaction 
(repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range"|
|Subrange repair|trunk|Parallel + 10 token ranges which have the same 
replicas|"No anticompaction (repairedAt == 0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
A single repair session will handle all subranges at once"|
|Subrange repair|trunk|Parallel + 10 token ranges which have different 
replicas|"No anticompaction (repairedAt==0)
 Out of sync ranges > 0
 Subsequent run must show no out of sync range
More than one repair session is triggered to process all subranges"|
|Subrange repair|trunk|"Single token range.
 Force terminate repair shortly after it was triggered."|Repair threads must be 
cleaned up|
|Incremental repair|trunk|"Parallel (mandatory)
 No compaction during repair"|"Anticompaction status (repairedAt != 0) on all 
SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|"Parallel (mandatory)
 Major compaction triggered during repair"|"Anticompaction status (repairedAt 
!= 0) on all SSTables
 No pending repair on SSTables after completion (could require to wait a bit as 
this will happen asynchronously)
 Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
|Incremental repair|trunk|Force terminate repair shortly after it was 
triggered.|Repair threads must be cleaned up|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs

2020-11-05 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-16244:
-
Fix Version/s: 4.0-rc

> Create a jvm upgrade dtest for mixed versions repairs
> -
>
> Key: CASSANDRA-16244
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16244
> Project: Cassandra
>  Issue Type: Task
>Reporter: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Repair during upgrades should fail on mixed version clusters.
> We'd need an in-jvm upgrade dtest to check that repair indeed fails as 
> expected with mixed current version+previous major version clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs

2020-11-02 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-16244:


 Summary: Create a jvm upgrade dtest for mixed versions repairs
 Key: CASSANDRA-16244
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16244
 Project: Cassandra
  Issue Type: Task
Reporter: Alexander Dejanovski


Repair during upgrades should fail on mixed version clusters.
We'd need an in-jvm upgrade dtest to check that repair indeed fails as expected 
with mixed current version+previous major version clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-11-02 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224834#comment-17224834
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

Thanks for the feedback [~marcuse] and [~jmckenzie]!
Good point for using in jvm upgrade dtests for testing mixed version repairs 
(y) I'll create a subticket for this.
I'll add the flag from CASSANDRA-3200 to the test plan for sure, but need to 
think a little bit about what to test precisely in this scenario. 

We'd need to figure out the following things:
* How/where do we provision the nodes?
AFAIK we don't have a test suite such as the one we're planning to build here 
which will require to spin up actual clusters on external VMs (no ccm).
Spinning up AWS instances is a low friction path with a tool such as 
tlp-cluster (we'll need a sponsor for hosting the instances).
k8s is probably down the path but it could be good to have the community 
operator before we use it. 
Are there any other obvious tools/ways to spin up multi instances clusters? 
* Which testing framework to use?
I personally like using Gherkin syntax based frameworks such as Cucumber, but 
we'd need to get a feel for the community's appetite to introduce such a 
framework.
Otherwise we'd probably fallback to JUnit but my take on this is that while 
it's really good for unit tests, it's no fit for integration tests.
Any input/opinion on testing framework is appreciated.
* Where is that test suite stored?
It would be good to have it stored directly in the Cassandra repo but we could 
store it in a side project like it was done for dtests.



> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16161) Validation Compactions causing Java GC pressure

2020-11-02 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224671#comment-17224671
 ] 

Alexander Dejanovski commented on CASSANDRA-16161:
--

My initial preference while reading this ticket was to use different throttles 
as well. As mentioned by [~mck], validation compactions put a slightly 
different type of pressure on nodes and folks might want to unthrottle 
validation compactions (current behavior) and keep a throttle on compactions.
But on the other hand, disks are now way faster than they used to be when 
compaction throttle was introduced and heap pressure is mostly what we're 
protecting the clusters from I guess. We used to consider at TLP that over 
45MB/s Cassandra couldn't keep up anyway in compaction sustained throughput 
because of heap pressure (it would only allow bursts). This will indeed change 
in 4.0 with all the improvements made to lower compaction heap pressure.
I also think that repair should be throttled in order to lighten its impact on 
clusters and force folks to investigate if it's not going fast enough, rather 
than harm clusters when using defaults to make it go fast.
Also, adding a new hidden configuration setting just for 3.0/3.x this close to 
4.0 going GA doesn't seem like the best thing to do.

TL;DR: +1 on using {{compaction_throughput_mb_per_sec}} to throttle validation 
compactions as well as standard compactions.

> Validation Compactions causing Java GC pressure
> ---
>
> Key: CASSANDRA-16161
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16161
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction, Local/Config, Tool/nodetool
>Reporter: Cameron Zemek
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.11.x, 3.11.8
>
> Attachments: 16161.patch
>
>
> Validation Compactions are not rate limited which can cause Java GC pressure 
> and result in spikes in latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Status: Ready to Commit  (was: Review In Progress)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail:

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Status: Patch Available  (was: Review In Progress)

LGTM

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail:

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Reviewers: Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Status: Review In Progress  (was: Patch Available)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222163#comment-17222163
 ] 

Alexander Dejanovski commented on CASSANDRA-15902:
--

The code looks good to me.
Patch works as expected and changes to the non-testing code are minimal.
Unit tests for repairs are all passing .

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-10-13 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212949#comment-17212949
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

Here's a test plan proposal: 

Generate/restore a workload of ~100GB to 200GB per node.
Some SSTables will have to be deleted (in a random fashion?) to make repair go 
through streaming sessions.
Perform repairs for a 3 nodes cluster with 4 cores each and 16GB RAM.
Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
subranges with different sets of replicas).

||  Mode||  Version ||  Settings||  Checks  ||
|   Full repair |   trunk   |   Sequential + All token ranges   
|   "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|   Full repair |   trunk   |   Parallel + Primary range
|   "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|   Full repair |   trunk   |   Force terminate repair shortly 
after it was triggered   |   Repair threads must be cleaned up   |
|   Full repair |   Mixed trunk + latest 3.11.x |   
Sequential + All token ranges   |   Repair should fail  |
|   Subrange repair |   trunk   |   Sequential + single token range 
|   "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|   Subrange repair |   trunk   |   Parallel + 10 token ranges 
which have the same replicas |   "No anticompaction (repairedAt == 0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are 
cleaned up after a force terminate"  |
|   Subrange repair |   trunk   |   Parallel + 10 token ranges 
which have different replicas|   "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are 
cleaned up after a force terminate"  |
|   Subrange repair |   trunk   |   "Single token range.
Force terminate repair shortly after it was triggered." |   Repair threads 
must be cleaned up   |
|   Subrange repair |   Mixed trunk + latest 3.11.x |   
Sequential + single token range |   Repair should fail  |
|   Incremental repair  |   trunk   |   "Parallel (mandatory)
No compaction during repair"|   "Anticompaction status (repairedAt != 
0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
|   Incremental repair  |   trunk   |   "Parallel (mandatory)
Major compaction triggered during repair"   |   "Anticompaction status 
(repairedAt != 0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
|   Incremental repair  |   trunk   |   Force terminate repair 
shortly after it was triggered.  |   Repair threads must be cleaned up  
 |
|   Incremental repair  |   Mixed trunk + latest 3.11.x |   
Parallel|   Repair should fail  |

I'm not sure about fuzz testing repair though. It's not a resilient process and 
isn't designed as such. Resiliency is obtained through third party tools that 
will reschedule failed repairs. If a node is/goes down and should be part of a 
repair session, the repair session will simply fail AFAIK.

The mixed version tests could be challenging to set up as we probably don't 
want to pin a specific version as being the "previous" one.
Should this test be performed consistently between trunk and the previous major 
version? On a major version bump (when trunk moves to 5.0), I'd expect the test 
to pass as repair will probably work for a bit, unless there's a check on 
version numbers during repair/streaming?

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools

[jira] [Updated] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-10-12 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15580:
-
Description: 
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: Alexander Dejanovski*

We aim for 4.0 to have the first fully functioning incremental repair solution 
(CASSANDRA-9143)! Furthermore we aim to verify that all types of repair: (full 
range, sub range, incremental) function as expected as well as ensuring 
community tools such as Reaper work. CASSANDRA-3200 adds an experimental option 
to reduce the amount of data streamed during repair, we should write more tests 
and see how it works with big nodes.

  was:
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: None*

We aim for 4.0 to have the first fully functioning incremental repair solution 
(CASSANDRA-9143)! Furthermore we aim to verify that all types of repair: (full 
range, sub range, incremental) function as expected as well as ensuring 
community tools such as Reaper work. CASSANDRA-3200 adds an experimental option 
to reduce the amount of data streamed during repair, we should write more tests 
and see how it works with big nodes.


> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-10-10 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211599#comment-17211599
 ] 

Alexander Dejanovski commented on CASSANDRA-15580:
--

I'll shepherd this ticket and start designing the test scenarios.

[~jmckenzie], regarding the use of Fallout, isn't that conversation supposed to 
take place in CASSANDRA-15585? If we go down that path (and Fallout looks like 
a great tool for the job), it means the reviewers and contributors need to get 
up to speed with it before anything can happen. Time wise it could play against 
fast completion.
My understanding also is that the OSS version of Fallout works with k8s 
exclusively, which would require k8s clusters to be available from CI (just 
mentioning it as I'm not sure that's something we have yet).

Also, where should these test live in the project? The natural fit for them 
would be dtests (which would mean using ccm), but running tests with big nodes 
could be challenging in this environment.
Was there a plan to create a new repo or just a new set of dtests?

I'd love for us to test repair on node with >= 100GB, but generating the data 
could take quite some time. Using backups could make that part faster if we can 
get an S3 bucket (or similar) to store the data on.

[~mck], you've been spending quite some time on the project CI lately, so your 
input on what can/cannot be done there would be much appreciated.

[~marcuse] [~vinaychella], are you still willing to review the deliverables 
here? What's your take on tooling and where the tests should live?

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: None*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-10-10 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15580:
-
Fix Version/s: (was: 4.0-beta)
   4.0-rc

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: None*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-15580) 4.0 quality testing: Repair

2020-10-10 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski reassigned CASSANDRA-15580:


Assignee: Alexander Dejanovski

> 4.0 quality testing: Repair
> ---
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/python
>Reporter: Josh McKenzie
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: None*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-09 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210644#comment-17210644
 ] 

Alexander Dejanovski commented on CASSANDRA-15902:
--

3.0 works as expected as well once patched.
I'll proceed with code review now.

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail:

[jira] [Updated] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem

2020-10-08 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15584:
-
Description: 
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: Benjamin Lerer*

Many users of Apache Cassandra employ open source tooling to automate Cassandra 
configuration, runtime management, and repair scheduling. Prior to release, we 
need to confirm that popular third-party tools function properly. 

Current list of tools:
|| Name || Status || Contact ||
| [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
| [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT 
STARTED* | [~stefan.miklosovic]| 
| [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT 
STARTED* | [~stefan.miklosovic]|
| [Instaclustr Cassandra 
operator|https://github.com/instaclustr/cassandra-operator]|  
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Instaclustr Cassandra Backup Restore | 
https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} | 
[~stefan.miklosovic]|
| [Instaclustr Cassandra Sidecar | 
https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color} 
| [~stefan.miklosovic]|
| [Cassandra SSTable generator | 
https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
 [~stefan.miklosovic]|
| [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
{color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
| [Cassandra Everywhere Strategy | 
https://github.com/instaclustr/cassandra-everywhere-strategy] | 
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
[~adejanovski]|
| [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *IN PROGRESS*| 
[~adejanovski]|
| [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck 
Dehay|
| 
[spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]|
 {color:#00875A}*DONE*{color}| [~jtgrabowski]|
| [cass operator|https://github.com/datastax/cass-operator]| 
{color:#00875A}*DONE*{color}| [~jimdickinson]|
| [metric 
collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| 
{color:#00875A}*DONE*{color}| [~tjake]|
| [managment 
API|https://github.com/datastax/management-api-for-apache-cassandra]| 
{color:#00875A}*DONE*{color}| [~tjake]|  

Columns descriptions:
* *Name*: Name and link to the tool official page
* *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any issue 
and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if testing 4.0 is 
part of your CI process.
* *Contact*: The person acting as the contact point for that tool. 

  was:
Reference [doc from 
NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
 for context.

*Shepherd: Benjamin Lerer*

Many users of Apache Cassandra employ open source tooling to automate Cassandra 
configuration, runtime management, and repair scheduling. Prior to release, we 
need to confirm that popular third-party tools function properly. 

Current list of tools:
|| Name || Status || Contact ||
| [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
| [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT 
STARTED* | [~stefan.miklosovic]| 
| [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT 
STARTED* | [~stefan.miklosovic]|
| [Instaclustr Cassandra 
operator|https://github.com/instaclustr/cassandra-operator]|  
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Instaclustr Cassandra Backup Restore | 
https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} | 
[~stefan.miklosovic]|
| [Instaclustr Cassandra Sidecar | 
https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color} 
| [~stefan.miklosovic]|
| [Cassandra SSTable generator | 
https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
 [~stefan.miklosovic]|
| [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
{color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
| [Cassandra Everywhere Strategy | 
https://github.com/instaclustr/cassandra-everywhere-strategy] | 
{color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
| [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
[~adejanovski]|
| [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *NOT STARTED*| 
[~adejanovski]|
| [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck 
Dehay|
|

[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem

2020-10-08 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210106#comment-17210106
 ] 

Alexander Dejanovski commented on CASSANDRA-15584:
--

[~blerer],

I've added integration tests for Medusa against trunk, which are breaking now 
when the sstableloader is being used on a cluster with client to server 
encryption : 
https://github.com/thelastpickle/cassandra-medusa/actions/runs/291725292

I need to investigate this issue more closely and maybe open a JIRA if there's 
indeed a problem in the sstableloader.

> 4.0 quality testing: Tooling - External Ecosystem
> -
>
> Key: CASSANDRA-15584
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15584
> Project: Cassandra
>  Issue Type: Task
>  Components: Tool/external
>Reporter: Josh McKenzie
>Assignee: Benjamin Lerer
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Benjamin Lerer*
> Many users of Apache Cassandra employ open source tooling to automate 
> Cassandra configuration, runtime management, and repair scheduling. Prior to 
> release, we need to confirm that popular third-party tools function properly. 
> Current list of tools:
> || Name || Status || Contact ||
> | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH 
> ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| 
> | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | 
> *NOT STARTED* | [~stefan.miklosovic]| 
> | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| 
> *NOT STARTED* | [~stefan.miklosovic]|
> | [Instaclustr Cassandra 
> operator|https://github.com/instaclustr/cassandra-operator]|  
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Instaclustr Cassandra Backup Restore | 
> https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} 
> | [~stefan.miklosovic]|
> | [Instaclustr Cassandra Sidecar | 
> https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color}
>  | [~stefan.miklosovic]|
> | [Cassandra SSTable generator | 
> https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}|
>  [~stefan.miklosovic]|
> | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | 
> {color:#00875A}*DONE*{color} |  [~stefan.miklosovic]|
> | [Cassandra Everywhere Strategy | 
> https://github.com/instaclustr/cassandra-everywhere-strategy] | 
> {color:#00875A}*DONE*{color} | [~stefan.miklosovic]|
> | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | 
> [~adejanovski]|
> | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *NOT STARTED*| 
> [~adejanovski]|
> | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| 
> Franck Dehay|
> | 
> [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]|
>  {color:#00875A}*DONE*{color}| [~jtgrabowski]|
> | [cass operator|https://github.com/datastax/cass-operator]| 
> {color:#00875A}*DONE*{color}| [~jimdickinson]|
> | [metric 
> collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|
> | [managment 
> API|https://github.com/datastax/management-api-for-apache-cassandra]| 
> {color:#00875A}*DONE*{color}| [~tjake]|  
> Columns descriptions:
> * *Name*: Name and link to the tool official page
> * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any 
> issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if 
> testing 4.0 is part of your CI process.
> * *Contact*: The person acting as the contact point for that tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-02 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206250#comment-17206250
 ] 

Alexander Dejanovski commented on CASSANDRA-15902:
--

So far, so good.
I've reproduced the issue in 3.11 using a low timeout in Reaper and repair 
sessions started to pile up indefinitely:

{code:java}
% x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep 
-v grep| cut -d' ' -f3) |grep 'Repair#'\" cassandra"
"Repair#11:1" #2193 daemon prio=5 os_prio=0 tid=0x7fe15b19f530 nid=0x74d8 
waiting on condition [0x7fe145968000]
"Repair#10:1" #2154 daemon prio=5 os_prio=0 tid=0x7fe16d7eceb0 nid=0x7471 
waiting on condition [0x7fe12bf12000]
"Repair#8:1" #2116 daemon prio=5 os_prio=0 tid=0x7fe150316b40 nid=0x73f1 
waiting on condition [0x7fe12ce09000]
"Repair#7:1" #2084 daemon prio=5 os_prio=0 tid=0x7fe150162f80 nid=0x73a9 
waiting on condition [0x7fe137894000]
"Repair#3:1" #1704 daemon prio=5 os_prio=0 tid=0x7fe10f1b98d0 nid=0x6b9a 
waiting on condition [0x7fe1428fc000]
"Repair#14:1" #1778 daemon prio=5 os_prio=0 tid=0x565030775bb0 nid=0x6d58 
waiting on condition [0x7f8d08659000]
"Repair#9:1" #1573 daemon prio=5 os_prio=0 tid=0x7f8d28770af0 nid=0x6b88 
waiting on condition [0x7f8d1ff39000]
"Repair#2:1" #1397 daemon prio=5 os_prio=0 tid=0x7f8d2815eb70 nid=0x6851 
waiting on condition [0x7f8d1f9a]
"Repair#1:1" #1375 daemon prio=5 os_prio=0 tid=0x7f8c67dcee40 nid=0x66a8 
waiting on condition [0x7f8d1cc6f000]
"Repair#1:1" #2412 daemon prio=5 os_prio=0 tid=0x7fc61d2a38f0 nid=0x6ed9 
waiting on condition [0x7fc60736d000]
{code}

Then I built the patched version and waited again for repairs to time out for a 
little while.
I never got more than one repair thread:

{code:java}
% x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep 
-v grep| cut -d' ' -f2) |grep 'Repair#'\" cassandra"
"Repair#21:1" #682 daemon prio=5 os_prio=0 tid=0x7f249854cc10 nid=0x7ced 
waiting on condition [0x7f246f779000]
{code}
 
I'm currently checking that repair still go through as expected with a regular 
timeout and that is still running. 
Once that's done, I'll check again against 3.0 and then perform a code review.

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-28 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Reviewers: Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Alexander Dejanovski, Alexander Dejanovski
   Status: Review In Progress  (was: Patch Available)

Starting testing and review.

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-28 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203277#comment-17203277
 ] 

Alexander Dejanovski commented on CASSANDRA-15902:
--

Hi [~moczarski],

I'm aware of similar reports regarding repair sessions not being cleaned up 
correctly.
I'll happily test this patch and perform a review.

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens

2020-08-18 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179700#comment-17179700
 ] 

Alexander Dejanovski commented on CASSANDRA-13701:
--

New CI run with some additional adjustements in timings 
[here|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/248/tests].

The last failing test is fixed in trunk by [this 
commit|https://github.com/apache/cassandra/commit/c94ececec0fcd87459858370396d6cd586853787].
 It's unrelated to this ticket.

I've squashed the commits in [my cassandra-dtests 
branch|https://github.com/adejanovski/cassandra-dtest/tree/CASSANDRA-13701], 
but I still need to drop the commit that points to [the patched version of 
ccm|https://github.com/adejanovski/ccm/tree/CASSANDRA-13701].

Let's wait for the conversation to settle in the ASF Slack before moving on 
here.
Maybe we should re-run CI again to see if we have some flaky tests that would 
be related to this ticket?

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens

2020-08-14 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177965#comment-17177965
 ] 

Alexander Dejanovski commented on CASSANDRA-13701:
--

I've identified several issues today:
 * ccm uses a hardcoded 30s timeout when waiting for events (like nodes to 
start) which doesn't work with the additional wait times that come with the new 
token allocation algorithm. Fix is 
[here|https://github.com/riptano/ccm/commit/8a91a5aa49473211863a1fb7a980206e5222ce5d].
 * ccm starts all nodes at the same time when cluster.start() is invoked, which 
creates clashes when the new token allocation algorithm is used and makes some 
tests flaky. Starting them sequentially using [this 
fix|https://github.com/riptano/ccm/commit/e6e4abcff375debde8195104c5cffd1cecb8d6cf],
 allowed all the bootstrap dtests to pass.
* [~jeromatron]'s branch is missing some commits in the current trunk that fix 
other failing dtests. Rebasing it over trunk is necessary to get them all to 
pass
* Adding a few seconds of sleep in 
[bootstrap_test.py::TestBootstrap::test_simultaneous_bootstrap|https://github.com/adejanovski/cassandra-dtest/blob/master/bootstrap_test.py#L769-L771]
 allows the test to pass.

I'm currently rerunning all dtests with the various fixes to see if I still get 
failures. I'll follow up on monday and hopefully push PRs to ccm and 
cassandra-dtests that will allow the patch to be applied (there are conflicts 
though so a rebase will be necessary).

A follow up discussion and ticket will probably be necessary because the new 
token allocation algorithm and concurrent bootstraps aren't working nicely 
together.

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens

2020-08-13 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176806#comment-17176806
 ] 

Alexander Dejanovski commented on CASSANDRA-13701:
--

Thanks [~brandon.williams], 

that's valuable information and I can move on to fixing the other tests.

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens

2020-08-12 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176483#comment-17176483
 ] 

Alexander Dejanovski commented on CASSANDRA-13701:
--

Quick update:

I was able to make the 
*bootstrap_test.py::TestBootstrap::test_simultaneous_bootstrap* pass with this 
branch.

The test assumes that both starting nodes will see each other when they check 
for endpoint collision. But if the nodes start at exactly the same time (or 
roughly), then they can both perform the check while none of them is gossiping 
yet, meaning only node1 is part of the ring, which allows them to get tokens 
and start bootstrapping.

Since there's a 30s pause, waiting for gossip to settle, adding a 10s pause 
between node2 and node3 startup allows us to "luckily" avoid the race condition.

The code is not bulletproof to this scenario though. 
I still wonder why this is only happening with the new token allocation 
algorithm. Furthermore, tests are executed with num_tokens = 1, which makes it 
fairly fast to pick a token.
It seems like the orchestration is different between the random token 
allocation and the rf based allocation which makes the race condition more 
obvious.

I'll check the other failing tests tomorrow to see if we're dealing with the 
same problems. 

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens

2020-08-12 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176340#comment-17176340
 ] 

Alexander Dejanovski commented on CASSANDRA-13701:
--

[~jeromatron], I'm picking this up.

Initial observation is that on the test_simultaneous_bootstrap test, node2 
manages to bootstrap before node3 gets a chance to get kicked off.
I'll go through the Cassandra code paths of bootstrap in order to understand 
how the new token allocation algorithm impacts us here.

Will send an update soon.

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-13701) Lower default num_tokens

2020-08-12 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski reassigned CASSANDRA-13701:


Assignee: Alexander Dejanovski

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Alexander Dejanovski
>Priority: Low
> Fix For: 4.0-alpha
>
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-29 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15878:
-
Test and Documentation Plan: unit tests added
 Status: Patch Available  (was: In Progress)

> Ec2Snitch fails on upgrade in legacy mode
> -
>
> Key: CASSANDRA-15878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
> Ec2Snitch to match AWS conventions.
> The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
> keep the same naming as before (while the "standard" mode uses the new naming 
> convention).
> When performing an upgrade in the us-west-2 region, the second node failed to 
> start with the following exception:
>  
> {code:java}
> ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
> snitch appears to be using the legacy naming scheme for regions, but existing 
> nodes in cluster are using the opposite: region(s) = [us-west-2], 
> availability zone(s) = [2a]. Please check the ec2_naming_scheme property in 
> the cassandra-rackdc.properties configuration file for more details.
> ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
> encountered during startup
> java.lang.IllegalStateException: null
>   at 
> org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
> {code}
>  
> The exception leads back to [this piece of 
> code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].
> After adding some logging, it turned out the DC name of the first upgraded 
> node was considered invalid as a legacy one:
> {code:java}
> INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC 
> us-west-2
> INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
> dcUsesLegacyFormat=false / usingLegacyNaming=true
> ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
> us-west-2
> {code}
>  
> The problem is that the regex that's used to identify legacy dc names will 
> match both old and new names : 
> {code:java}
> boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
> {code}
> Knowing that some dc names didn't change between the two modes (us-west-2 for 
> example), I don't see how we can use the dc names to detect if the legacy 
> mode is being used by other nodes in the cluster.
>   
>  The rack names on the other hand are totally different in the legacy and 
> standard modes and can be used to detect mismatching settings.
>   
>  My go to fix would be to drop the check on datacenters by removing the 
> following lines: 
> [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-29 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147566#comment-17147566
 ] 

Alexander Dejanovski commented on CASSANDRA-15878:
--

Thanks for the feedback [~jolynch].

I've added back the DC name check and adjusted it as suggested. I provided 
accurate informations on which case we're actually covering now with this check.

[~mck], I've reintroduced the unit tests that I had deleted and changed the 
assertions where needed.

You can check the changes here: 
[https://github.com/apache/cassandra/compare/trunk...thelastpickle:CASSANDRA-15878]

Let me know what you think.

> Ec2Snitch fails on upgrade in legacy mode
> -
>
> Key: CASSANDRA-15878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
> Ec2Snitch to match AWS conventions.
> The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
> keep the same naming as before (while the "standard" mode uses the new naming 
> convention).
> When performing an upgrade in the us-west-2 region, the second node failed to 
> start with the following exception:
>  
> {code:java}
> ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
> snitch appears to be using the legacy naming scheme for regions, but existing 
> nodes in cluster are using the opposite: region(s) = [us-west-2], 
> availability zone(s) = [2a]. Please check the ec2_naming_scheme property in 
> the cassandra-rackdc.properties configuration file for more details.
> ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
> encountered during startup
> java.lang.IllegalStateException: null
>   at 
> org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
> {code}
>  
> The exception leads back to [this piece of 
> code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].
> After adding some logging, it turned out the DC name of the first upgraded 
> node was considered invalid as a legacy one:
> {code:java}
> INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC 
> us-west-2
> INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
> dcUsesLegacyFormat=false / usingLegacyNaming=true
> ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
> us-west-2
> {code}
>  
> The problem is that the regex that's used to identify legacy dc names will 
> match both old and new names : 
> {code:java}
> boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
> {code}
> Knowing that some dc names didn't change between the two modes (us-west-2 for 
> example), I don't see how we can use the dc names to detect if the legacy 
> mode is being used by other nodes in the cluster.
>   
>  The rack names on the other hand are totally different in the legacy and 
> standard modes and can be used to detect mismatching settings.
>   
>  My go to fix would be to drop the check on datacenters by removing the 
> following lines: 
> [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-26 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146442#comment-17146442
 ] 

Alexander Dejanovski commented on CASSANDRA-15878:
--

I've pushed a commit with a potential fix and updated unit tests: 
[https://github.com/apache/cassandra/pull/653/commits/7a53846a217102143ae56416ebcf534c59de93e6]

[~jolynch], I'd love to have your input on this since you reviewed the original 
ticket that brought this change.

Are there cases I'm not seeing where the dc name would be useful to check?

> Ec2Snitch fails on upgrade in legacy mode
> -
>
> Key: CASSANDRA-15878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
> Ec2Snitch to match AWS conventions.
> The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
> keep the same naming as before (while the "standard" mode uses the new naming 
> convention).
> When performing an upgrade in the us-west-2 region, the second node failed to 
> start with the following exception:
>  
> {code:java}
> ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
> snitch appears to be using the legacy naming scheme for regions, but existing 
> nodes in cluster are using the opposite: region(s) = [us-west-2], 
> availability zone(s) = [2a]. Please check the ec2_naming_scheme property in 
> the cassandra-rackdc.properties configuration file for more details.
> ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
> encountered during startup
> java.lang.IllegalStateException: null
>   at 
> org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
> {code}
>  
> The exception leads back to [this piece of 
> code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].
> After adding some logging, it turned out the DC name of the first upgraded 
> node was considered invalid as a legacy one:
> {code:java}
> INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC 
> us-west-2
> INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
> dcUsesLegacyFormat=false / usingLegacyNaming=true
> ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
> us-west-2
> {code}
>  
> The problem is that the regex that's used to identify legacy dc names will 
> match both old and new names : 
> {code:java}
> boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
> {code}
> Knowing that some dc names didn't change between the two modes (us-west-2 for 
> example), I don't see how we can use the dc names to detect if the legacy 
> mode is being used by other nodes in the cluster.
>   
>  The rack names on the other hand are totally different in the legacy and 
> standard modes and can be used to detect mismatching settings.
>   
>  My go to fix would be to drop the check on datacenters by removing the 
> following lines: 
> [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-24 Thread Alexander Dejanovski (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski reassigned CASSANDRA-15878:


Assignee: Alexander Dejanovski

> Ec2Snitch fails on upgrade in legacy mode
> -
>
> Key: CASSANDRA-15878
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Distributed Metadata
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
> Ec2Snitch to match AWS conventions.
> The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
> keep the same naming as before (while the "standard" mode uses the new naming 
> convention).
> When performing an upgrade in the us-west-2 region, the second node failed to 
> start with the following exception:
>  
> {code:java}
> ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
> snitch appears to be using the legacy naming scheme for regions, but existing 
> nodes in cluster are using the opposite: region(s) = [us-west-2], 
> availability zone(s) = [2a]. Please check the ec2_naming_scheme property in 
> the cassandra-rackdc.properties configuration file for more details.
> ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
> encountered during startup
> java.lang.IllegalStateException: null
>   at 
> org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
>   at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
>   at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
>   at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
> {code}
>  
> The exception leads back to [this piece of 
> code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].
> After adding some logging, it turned out the DC name of the first upgraded 
> node was considered invalid as a legacy one:
> {code:java}
> INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC 
> us-west-2
> INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
> dcUsesLegacyFormat=false / usingLegacyNaming=true
> ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
> us-west-2
> {code}
>  
> The problem is that the regex that's used to identify legacy dc names will 
> match both old and new names : 
> {code:java}
> boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
> {code}
> Knowing that some dc names didn't change between the two modes (us-west-2 for 
> example), I don't see how we can use the dc names to detect if the legacy 
> mode is being used by other nodes in the cluster.
>   
>  The rack names on the other hand are totally different in the legacy and 
> standard modes and can be used to detect mismatching settings.
>   
>  My go to fix would be to drop the check on datacenters by removing the 
> following lines: 
> [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode

2020-06-16 Thread Alexander Dejanovski (Jira)

Alexander Dejanovski created CASSANDRA-15878:


 Summary: Ec2Snitch fails on upgrade in legacy mode
 Key: CASSANDRA-15878
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15878
 Project: Cassandra
  Issue Type: Bug
Reporter: Alexander Dejanovski


CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the 
Ec2Snitch to match AWS conventions.

The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and 
keep the same naming as before (while the "standard" mode uses the new naming 
convention).

When performing an upgrade in the us-west-2 region, the second node failed to 
start with the following exception:

 
{code:java}
ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled 
snitch appears to be using the legacy naming scheme for regions, but existing 
nodes in cluster are using the opposite: region(s) = [us-west-2], availability 
zone(s) = [2a]. Please check the ec2_naming_scheme property in the 
cassandra-rackdc.properties configuration file for more details.
ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception 
encountered during startup
java.lang.IllegalStateException: null
at 
org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573)
at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530)
at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:659)
at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:610)
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767)
{code}
 

The exception leads back to [this piece of 
code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185].

After adding some logging, it turned out the DC name of the first upgraded node 
was considered invalid as a legacy one:
{code:java}
INFO  [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC us-west-2
INFO  [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - 
dcUsesLegacyFormat=false / usingLegacyNaming=true
ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name 
us-west-2
{code}
 

The problem is that the regex that's used to identify legacy dc names will 
match both old and new names : 
{code:java}
boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*");
{code}
Knowing that some dc names didn't change between the two modes (us-west-2 for 
example), I don't see how we can use the dc names to detect if the legacy mode 
is being used by other nodes in the cluster.
  
 The rack names on the other hand are totally different in the legacy and 
standard modes and can be used to detect mismatching settings.
  
 My go to fix would be to drop the check on datacenters by removing the 
following lines: 
[https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels

2020-04-01 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072443#comment-17072443
 ] 

Alexander Dejanovski commented on CASSANDRA-15661:
--

Thanks for adding more info on the native connection limit logging.

I checked the test results and indeed I don't see how changing the logging 
levels could be responsible for the DTests failure here.

The patch looks good to me now (y)

>  Improve logging by using more appropriate levels
> -
>
> Key: CASSANDRA-15661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Normal
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> There are a number of log statements using logging levels that are a bit too 
> conservative.  For example:
> * Flushing memtables is currently at debug.  This is a relatively rare event 
> that is important enough to be INFO
> * When compaction finishes we log the progress at debug
> * Different steps in incremental repair are logged as debug, should be INFO
> * when reaching connection limits in ConnectionLimitHandler.java we log at 
> warn rather than error.  Since this is a client disconnect it’s more than a 
> warning, we’re taking action and disconnecting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels

2020-03-27 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068580#comment-17068580
 ] 

Alexander Dejanovski commented on CASSANDRA-15661:
--

[~rustyrazorblade], I'm overall super happy to get more loggings back in 
system.log.
Having all of what's happening in flushes and compactions at debug level made 
my life as an ops much harder over the past years.
I added a few comments on the PR regarding some that may not be fit for INFO 
level. They look more like ways to actually debug some implementation details 
around repair.

Let me know what you think.

>  Improve logging by using more appropriate levels
> -
>
> Key: CASSANDRA-15661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Normal
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are a number of log statements using logging levels that are a bit too 
> conservative.  For example:
> * Flushing memtables is currently at debug.  This is a relatively rare event 
> that is important enough to be INFO
> * When compaction finishes we log the progress at debug
> * Different steps in incremental repair are logged as debug, should be INFO
> * when reaching connection limits in ConnectionLimitHandler.java we log at 
> warn rather than error.  Since this is a client disconnect it’s more than a 
> warning, we’re taking action and disconnecting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels

2020-03-27 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068385#comment-17068385
 ] 

Alexander Dejanovski commented on CASSANDRA-15661:
--

Starting the review on this ticket.

>  Improve logging by using more appropriate levels
> -
>
> Key: CASSANDRA-15661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Logging
>Reporter: Jon Haddad
>Assignee: Jon Haddad
>Priority: Normal
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a number of log statements using logging levels that are a bit too 
> conservative.  For example:
> * Flushing memtables is currently at debug.  This is a relatively rare event 
> that is important enough to be INFO
> * When compaction finishes we log the progress at debug
> * Different steps in incremental repair are logged as debug, should be INFO
> * when reaching connection limits in ConnectionLimitHandler.java we log at 
> warn rather than error.  Since this is a client disconnect it’s more than a 
> warning, we’re taking action and disconnecting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-11105) cassandra-stress tool - InvalidQueryException: Batch too large

2020-01-30 Thread Alexander Dejanovski (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026822#comment-17026822
 ] 

Alexander Dejanovski commented on CASSANDRA-11105:
--

I agree with [~mck].

The code has evolved too much anyway since my patch was written, and internally 
we've moved our efforts on a cassandra-stress replacement tool.

Happy to have the ticket closed as "won't do".

> cassandra-stress tool - InvalidQueryException: Batch too large
> --
>
> Key: CASSANDRA-11105
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11105
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Tools
> Environment: Cassandra 2.2.4, Java 8, CentOS 6.5
>Reporter: Ralf Steppacher
>Priority: Normal
> Fix For: 4.0
>
> Attachments: 11105-trunk.txt, batch_too_large.yaml
>
>
> I am using Cassandra 2.2.4 and I am struggling to get the cassandra-stress 
> tool to work for my test scenario. I have followed the example on 
> http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
>  to create a yaml file describing my test (attached).
> I am collecting events per user id (text, partition key). Events have a 
> session type (text), event type (text), and creation time (timestamp) 
> (clustering keys, in that order). Plus some more attributes required for 
> rendering the events in a UI. For testing purposes I ended up with the 
> following column spec and insert distribution:
> {noformat}
> columnspec:
>   - name: created_at
> cluster: uniform(10..1)
>   - name: event_type
> size: uniform(5..10)
> population: uniform(1..30)
> cluster: uniform(1..30)
>   - name: session_type
> size: fixed(5)
> population: uniform(1..4)
> cluster: uniform(1..4)
>   - name: user_id
> size: fixed(15)
> population: uniform(1..100)
>   - name: message
> size: uniform(10..100)
> population: uniform(1..100B)
> insert:
>   partitions: fixed(1)
>   batchtype: UNLOGGED
>   select: fixed(1)/120
> {noformat}
> Running stress tool for just the insert prints 
> {noformat}
> Generating batches with [1..1] partitions and [0..1] rows (of [10..120] 
> total rows in the partitions)
> {noformat}
> and then immediately starts flooding me with 
> {{com.datastax.driver.core.exceptions.InvalidQueryException: Batch too 
> large}}. 
> Why I should be exceeding the {{batch_size_fail_threshold_in_kb: 50}} in the 
> {{cassandra.yaml}} I do not understand. My understanding is that the stress 
> tool should generate one row per batch. The size of a single row should not 
> exceed {{8+10*3+5*3+15*3+100*3 = 398 bytes}}. Assuming a worst case of all 
> text characters being 3 byte unicode characters. 
> This is how I start the attached user scenario:
> {noformat}
> [rsteppac@centos bin]$ ./cassandra-stress user 
> profile=../batch_too_large.yaml ops\(insert=1\) -log level=verbose 
> file=~/centos_event_by_patient_session_event_timestamp_insert_only.log -node 
> 10.211.55.8
> INFO  08:00:07 Did not find Netty's native epoll transport in the classpath, 
> defaulting to NIO.
> INFO  08:00:08 Using data-center name 'datacenter1' for 
> DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct 
> datacenter name with DCAwareRoundRobinPolicy constructor)
> INFO  08:00:08 New Cassandra host /10.211.55.8:9042 added
> Connected to cluster: Titan_DEV
> Datatacenter: datacenter1; Host: /10.211.55.8; Rack: rack1
> Created schema. Sleeping 1s for propagation.
> Generating batches with [1..1] partitions and [0..1] rows (of [10..120] 
> total rows in the partitions)
> com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large
>   at 
> com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
>   at 
> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:271)
>   at 
> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:185)
>   at 
> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:55)
>   at 
> org.apache.cassandra.stress.operations.userdefined.SchemaInsert$JavaDriverRun.run(SchemaInsert.java:87)
>   at 
> org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:159)
>   at 
> org.apache.cassandra.stress.operations.userdefined.SchemaInsert.run(SchemaInsert.java:119)
>   at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:309)
> Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch 
> too large
>   at 
> com.datastax.driver.core.Responses$Error.asException(Responses.java:125)
>   at 
>

[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-09-18 Thread Alexander Dejanovski (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619273#comment-16619273
 ] 

Alexander Dejanovski commented on CASSANDRA-14685:
--

Hi [~bdeggleston],

Fair enough, I reckon I didn't wait that long for the SSTables to be released.

if the SSTables get released eventually and you can't detect all types of 
failures to release them, I guess it would be worth failing a repair if some 
SSTables with overlapping token ranges are still part of another repair session.
Otherwise, your left with the impression that running a repair would work 
correctly although some SSTables were skipped (and will be rolled back later). 
Wdyt?
Advising to use "nodetool repair_admin" in the error message would help 
discover this new command. Stopping the session using it did the trick and the 
SSTables were released as expected.

One weird behavior of streaming is that when the coordinator goes down, 
"nodetool netstats" still shows progress on the replicas until it reaches 100% 
and it stays like this. It even starts streaming new files although the target 
node is still down.

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Assignee: Jason Brown
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   OwnsHost ID
>Rack
> UN  127.0.0.1  228,64 KiB  256  ?   
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
> UN  127.0.0.2  60,09 MiB  256  ?   
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
> UN  127.0.0.3  57,59 MiB  256

[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-09-11 Thread Alexander Dejanovski (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610574#comment-16610574
 ] 

Alexander Dejanovski commented on CASSANDRA-14685:
--

[~jasobrown], sure thing, no worries.

For the "pending repair" part, I must add that it doesn't happen when a replica 
node goes down during repair, even if it comes back way after repair is over on 
the coordinator. Shortly after restart, SSTables are correctly released from 
the pending repair.

It's only when the coordinator goes down that replicas remain in pending repair 
state, even after a restart of the Cassandra process on these nodes.

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Assignee: Jason Brown
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   OwnsHost ID
>Rack
> UN  127.0.0.1  228,64 KiB  256  ?   
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
> UN  127.0.0.2  60,09 MiB  256  ?   
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
> UN  127.0.0.3  57,59 MiB  256  ?   
> a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
> {noformat}
> sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
> "pending repair" state :
> {noformat}
> ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
> grep repair
> SSTable: 
> /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
> Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
> {noformat}
>

[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599128#comment-16599128
 ] 

Alexander Dejanovski commented on CASSANDRA-14685:
--

[~jasobrown],

indeed, nodes 2 and 3 are still showing ongoing streams although node1 is down 
: 

 
{noformat}
$ ccm node2 nodetool netstats
Mode: NORMAL
Repair e28883b0-ad4b-11e8-82ca-5fbf27df5fb6
 /127.0.0.1
 Sending 2 files, 49304220 bytes total. Already sent 0 files, 5373952 bytes 
total
 
/Users/adejanovski/.ccm/inc-repair-issue/node2/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db
 5373952/34243878 bytes(15%) sent to idx:0/127.0.0.1
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 2 0
Small messages n/a 0 244612 0
Gossip messages n/a 23 531 0
$ ccm node3 nodetool netstats
Mode: NORMAL
Repair e269d820-ad4b-11e8-82ca-5fbf27df5fb6
 /127.0.0.1
 Sending 2 files, 49166315 bytes total. Already sent 1 files, 11748602 bytes 
total
 
/Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-11-big-Data.db
 8865018/8865018 bytes(100%) sent to idx:0/127.0.0.1
 
/Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db
 2883584/34198115 bytes(8%) sent to idx:0/127.0.0.1
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 2 0
Small messages n/a 0 244611 0
Gossip messages n/a 0 820 0
{noformat}
 

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Assignee: Jason Brown
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
>

[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-14685:
-
Description: 
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
--compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
ccm node1 stop
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
ccm node1 start{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens   OwnsHost ID  
 Rack
UN  127.0.0.1  228,64 KiB  256  ?   
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256  ?   
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256  ?   
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.

  was:
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e

[jira] [Created] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)

Alexander Dejanovski created CASSANDRA-14685:


 Summary: Incremental repair 4.0 : SSTables remain locked forever 
if the coordinator dies during streaming 
 Key: CASSANDRA-14685
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
 Project: Cassandra
  Issue Type: Bug
  Components: Repair
Reporter: Alexander Dejanovski


The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
--compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens   OwnsHost ID  
 Rack
UN  127.0.0.1  228,64 KiB  256  ?   
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256  ?   
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256  ?   
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-10399) Create default Stress tables without compact storage

2018-04-03 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423969#comment-16423969
 ] 

Alexander Dejanovski commented on CASSANDRA-10399:
--

This ticket should be closed as CASSANDRA-10857 already removed the use of 
COMPACT STORAGE throughout the whole codebase for 4.0.

> Create default Stress tables without compact storage 
> -
>
> Key: CASSANDRA-10399
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10399
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Sebastian Estevez
>Assignee: mck
>Priority: Minor
>  Labels: stress
> Fix For: 4.x
>
>
> ~$ cassandra-stress write
> {code}
> cqlsh> desc TABLE keyspace1.standard1
> CREATE TABLE keyspace1.standard1 (
> key blob PRIMARY KEY,
> "C0" blob,
> "C1" blob,
> "C2" blob,
> "C3" blob,
> "C4" blob
> ) WITH COMPACT STORAGE
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
> AND compression = {}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = 'NONE';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14318) Fix query pager DEBUG log leak causing hit in paged reads throughput

2018-04-03 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423964#comment-16423964
 ] 

Alexander Dejanovski commented on CASSANDRA-14318:
--

Thanks for reviewing and merging [~pauloricardomg] !

CASSANDRA-10857 removed compact storage options in trunk and the 
standard1 tables are no longer using it : 
[https://github.com/apache/cassandra/commit/07fbd8ee6042797aaade90357d625ba9d79c31e0#diff-e5d5cb263c5c84c322cd09391af46d7dL141]
 

> Fix query pager DEBUG log leak causing hit in paged reads throughput
> 
>
> Key: CASSANDRA-14318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14318
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Major
>  Labels: lhf, performance
> Fix For: 2.2.13
>
> Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 
> nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, 
> flame_graph_snapshot.png
>
>
> Debug logging can involve in many cases (especially very low latency ones) a 
> very important overhead on the read path in 2.2 as we've seen when upgrading 
> clusters from 2.0 to 2.2.
> The performance impact was especially noticeable on the client side metrics, 
> where p99 could go up to 10 times higher, while ClientRequest metrics 
> recorded by Cassandra didn't show any overhead.
> Below shows latencies recorded on the client side with debug logging on 
> first, and then without it :
> !debuglogging.png!  
> We generated a flame graph before turning off debug logging that shows the 
> read call stack is dominated by debug logging : 
> !flame_graph_snapshot.png!
> I've attached the original flame graph for exploration.
> Once disabled, the new flame graph shows that the read call stack gets 
> extremely thin, which is further confirmed by client recorded metrics : 
> !flame22 nodebug sjk svg.png!
> The query pager code has been reworked since 3.0 and it looks like 
> log.debug() calls are gone there, but for 2.2 users and to prevent such 
> issues to appear with default settings, I really think debug logging should 
> be disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

2018-03-30 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420677#comment-16420677
 ] 

Alexander Dejanovski commented on CASSANDRA-14346:
--

I was told that my comments sounded like I'm strongly opposed to this ticket, 
which is absolutely not the case so I'll sum up my thoughts here : 
 * Coordinated repair is a must have and should be the first thing that's 
implemented
 * Scheduling and (especially) auto scheduling will require more thoughts and 
discussion IMHO, at least as long as incremental repair has not proved to be 
bulletproof in 4.0 (we still have to see it running in production for a while). 
Once we can repair any table/keyspace in just a few minutes things will be very 
different.
 * Based on what the Apache Cassandra project went through with new features 
lately, I wouldn't rush into implementing all of this by default and take a 
more cautious approach for 4.0.

On a side note, because one might think I'm biased in that conversation (hum 
monologue so far), removing boilerplate from Reaper to have some features like 
computing the splits or coordinating the repair jobs handled by Cassandra 
internally would actually make me VERY happy.

> Scheduled Repair in Cassandra
> -
>
> Key: CASSANDRA-14346
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Repair
>Reporter: Joseph Lynch
>Priority: Major
>  Labels: CommunityFeedbackRequested
> Fix For: 4.0
>
> Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

2018-03-30 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420577#comment-16420577
 ] 

Alexander Dejanovski commented on CASSANDRA-14346:
--

Two other issues with automated scheduling of repairs would be : 
 * Rolling upgrades : All repairs would have to be terminated and schedules 
stopped as soon as the cluster is running mixed versions
 * Expansion to new DCs : if repair triggers during the expansion to a new DC 
before rebuild has fully ended on all nodes, the cluster will be crushed by the 
entropy repair will find. Since many users will not be aware that the cluster 
is constantly repairing itself, this is likely to happen a lot.

The latter could be mitigated if a rebuild is detected and appropriate measures 
were taken. I'm not sure how we can detect this flawlessly though and there 
would still be many cases where the cluster has been expanded but rebuild isn't 
started right after.

It could be argued that any scheduled repair system is subject to the same 
caveats, but the difference is that those systems are setup by a user, not by 
the database itself, which should then be responsible for protecting itself 
against such scenarios.

> Scheduled Repair in Cassandra
> -
>
> Key: CASSANDRA-14346
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Repair
>Reporter: Joseph Lynch
>Priority: Major
>  Labels: CommunityFeedbackRequested
> Fix For: 4.0
>
> Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

2018-03-30 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420504#comment-16420504
 ] 

Alexander Dejanovski commented on CASSANDRA-14346:
--

I really like the idea of making repair something that is coordinated by the 
cluster instead of being node centric like currently.
This is how it should be implemented, and external tools should only add 
features over this. nodetool really should be doing this by default.
I globally agree with the state machine that is detailed (haven't spent that 
much time on it though...)

I disagree with the doc Resiliency's point 6 that adding nodes won't impact the 
repair : it will change the token ranges and some of the splits will now spread 
across different replicas which will make them unsuitable for repair (think of 
clusters with 256 vnodes per node).
You either have to cancel the repair or recompute the remaining splits to move 
on with the job.

I would add a feature to your nodetool repairstatus command that allows to list 
only the currently running repairs.

Then I think the approach of implementing a fully automated, seamless, 
continuous repair "that just works" without user intervention is unsafe in the 
wild, there are too many caveats.
There are many different types of cluster out there and some of them just 
cannot run repair without careful tuning or monitoring (if at all).
The current design shows no backpressure mechanism to ensure that further 
running sequences won't harm the cluster because it's already running late on 
compactions (may it be due to overstreaming or entropy, or just the activity of 
the cluster).
Repairing by table will add a lot of overhead over repairing a list of tables 
(or all) in a single session, unless multiple repairs at once on a node are 
allowed, which won't permit to safely terminate a single repair.
It is also unclear in the current design if repair can be disabled for select 
tables for example (like "type: none").
The proposal doesn't seem to involve any change into how "nodetool repair" 
behaves. Will it be changed to use the state machine and coordinate throughout 
the cluster ?

Trying to replace external tools with built in features has its limits I think, 
and currently the design gives only limited control by such external tools (may 
it be Reaper or Datastax repair service or Priam or ...).
To make an analogy that was seen recently on the ML, it's as if you implemented 
automatic spreading of configuration changes from within Cassandra instead of 
relying on tools like Chef or Puppet.
You'll still need global tools to manage repairs over several clusters anyway, 
which a Cassandra built-in feature cannot (and should not) provide.

My point is that making repair smarter and coordinated within Cassandra is a 
great idea and I support it 100%, but the current design makes it too automated 
and the defaults could easily lead to severe performance problems without the 
user triggering anything.
I don't know either how it could be made to work along user defined repairs as 
you'll need to force terminate some sessions.

To summarize, I would put aside the scheduling features and implement the 
coordinated repairs by splits within Cassandra. The StorageServiceMBean should 
evolve to allow manually setting the number of splits by node, or rely on a 
number of split generated by Cassandra itself.
Then it should also be possible to track progress externally by listing splits 
(sequences) through JMX, and pause/resume select repair runs.

Also, the current design should evolve to allow a single sequence to include 
multiple token ranges. We have that feature waiting to be merged in Reaper to 
group token ranges that have the same replicas, in order to reduce the overhead 
of vnodes.
Starting with 3.0, repair jobs can be triggered with multiple token ranges that 
will be executed as a single session if the replicas are the same for all. So, 
to prevent having to change the data model in the future, I'd suggest storing a 
list of token ranges instead of just one.
Repair events should be tracked in a separate table also to avoid overwriting 
the last event each time (one thing Reaper currently sucks at as well).

I'll go back to the document soon and add my comments there.

 

Cheers

> Scheduled Repair in Cassandra
> -
>
> Key: CASSANDRA-14346
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Repair
>Reporter: Joseph Lynch
>Priority: Major
>  Labels: CommunityFeedbackRequested
> Fix For: 4.0
>
> Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual

[jira] [Commented] (CASSANDRA-14318) Debug logging can create massive performance issues

2018-03-27 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415923#comment-16415923
 ] 

Alexander Dejanovski commented on CASSANDRA-14318:
--

For the record, the same tests on 3.11.2 didn't show any notable performance 
difference between debug on and off : 

Cassandra 3.11.2 debug on : 
{noformat}
Results:
Op rate : 18 777 op/s [read_event_1: 3 165 op/s, read_event_2: 3 109 op/s, 
read_event_3: 12 562 op/s]
Partition rate : 6 215 pk/s [read_event_1: 3 165 pk/s, read_event_2: 3 109 
pk/s, read_event_3: 0 pk/s]
Row rate : 6 215 row/s [read_event_1: 3 165 row/s, read_event_2: 3 109 row/s, 
read_event_3: 0 row/s]
Latency mean : 6,7 ms [read_event_1: 6,7 ms, read_event_2: 6,7 ms, 
read_event_3: 6,6 ms]
Latency median : 5,0 ms [read_event_1: 5,0 ms, read_event_2: 5,0 ms, 
read_event_3: 4,9 ms]
Latency 95th percentile : 15,6 ms [read_event_1: 15,5 ms, read_event_2: 15,9 
ms, read_event_3: 15,5 ms]
Latency 99th percentile : 43,3 ms [read_event_1: 42,7 ms, read_event_2: 44,2 
ms, read_event_3: 43,2 ms]
Latency 99.9th percentile : 82,0 ms [read_event_1: 80,3 ms, read_event_2: 82,4 
ms, read_event_3: 82,1 ms]
Latency max : 272,4 ms [read_event_1: 272,4 ms, read_event_2: 268,7 ms, 
read_event_3: 245,1 ms]
Total partitions : 330 970 [read_event_1: 165 386, read_event_2: 165 584, 
read_event_3: 0]
Total errors : 0 [read_event_1: 0, read_event_2: 0, read_event_3: 0]
Total GC count : 42
Total GC memory : 13,102 GiB
Total GC time : 1,8 seconds
Avg GC time : 42,4 ms
StdDev GC time : 1,3 ms
Total operation time : 00:00:53{noformat}
 


Cassandra 3.11.2 debug off : 
{noformat}
Results:
Op rate : 18 853 op/s [read_event_1: 3 138 op/s, read_event_2: 3 137 op/s, 
read_event_3: 12 578 op/s]
Partition rate : 6 275 pk/s [read_event_1: 3 138 pk/s, read_event_2: 3 137 
pk/s, read_event_3: 0 pk/s]
Row rate : 6 275 row/s [read_event_1: 3 138 row/s, read_event_2: 3 137 row/s, 
read_event_3: 0 row/s]
Latency mean : 6,7 ms [read_event_1: 6,7 ms, read_event_2: 6,7 ms, 
read_event_3: 6,7 ms]
Latency median : 5,0 ms [read_event_1: 5,1 ms, read_event_2: 5,1 ms, 
read_event_3: 5,0 ms]
Latency 95th percentile : 15,5 ms [read_event_1: 15,5 ms, read_event_2: 15,6 
ms, read_event_3: 15,4 ms]
Latency 99th percentile : 39,9 ms [read_event_1: 41,0 ms, read_event_2: 39,6 
ms, read_event_3: 39,6 ms]
Latency 99.9th percentile : 73,3 ms [read_event_1: 73,4 ms, read_event_2: 71,6 
ms, read_event_3: 73,6 ms]
Latency max : 367,0 ms [read_event_1: 240,5 ms, read_event_2: 250,3 ms, 
read_event_3: 367,0 ms]
Total partitions : 332 852 [read_event_1: 166 447, read_event_2: 166 405, 
read_event_3: 0]
Total errors : 0 [read_event_1: 0, read_event_2: 0, read_event_3: 0]
Total GC count : 46
Total GC memory : 14,024 GiB
Total GC time : 2,0 seconds
Avg GC time : 42,7 ms
StdDev GC time : 3,9 ms
Total operation time : 00:00:53{noformat}
The improvement over 2.2 is nice though :)

 

> Debug logging can create massive performance issues
> ---
>
> Key: CASSANDRA-14318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14318
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Major
>  Labels: lhf, performance
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 
> nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, 
> flame_graph_snapshot.png
>
>
> Debug logging can involve in many cases (especially very low latency ones) a 
> very important overhead on the read path in 2.2 as we've seen when upgrading 
> clusters from 2.0 to 2.2.
> The performance impact was especially noticeable on the client side metrics, 
> where p99 could go up to 10 times higher, while ClientRequest metrics 
> recorded by Cassandra didn't show any overhead.
> Below shows latencies recorded on the client side with debug logging on 
> first, and then without it :
> !debuglogging.png!  
> We generated a flame graph before turning off debug logging that shows the 
> read call stack is dominated by debug logging : 
> !flame_graph_snapshot.png!
> I've attached the original flame graph for exploration.
> Once disabled, the new flame graph shows that the read call stack gets 
> extremely thin, which is further confirmed by client recorded metrics : 
> !flame22 nodebug sjk svg.png!
> The query pager code has been reworked since 3.0 and it looks like 
> log.debug() calls are gone there, but for 2.2 users and to prevent such 
> issues to appear with default settings, I really think debug logging should 
> be disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail:

[jira] [Comment Edited] (CASSANDRA-14318) Debug logging can create massive performance issues

2018-03-27 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415846#comment-16415846
 ] 

Alexander Dejanovski edited comment on CASSANDRA-14318 at 3/27/18 4:16 PM:
---

[~jjirsa]: apparently the ReadCallback class already logs at TRACE and not 
DEBUG on the latest 2.2.

I've created the fix that downgrades debug logging to trace logging in the 
query pager classes, and here are the results : 

debug on - no fix :
{noformat}
Results:
op rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
partition rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
row rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
latency mean : 19,1 [read_event_1:15,4, read_event_2:15,4, read_event_3:21,0]
latency median : 15,6 [read_event_1:14,2, read_event_2:14,0, read_event_3:16,3]
latency 95th percentile : 39,1 [read_event_1:28,4, read_event_2:28,6, 
read_event_3:44,2]
latency 99th percentile : 75,6 [read_event_1:52,9, read_event_2:53,6, 
read_event_3:87,7]
latency 99.9th percentile : 315,7 [read_event_1:101,0, read_event_2:110,1, 
read_event_3:361,1]
latency max : 609,1 [read_event_1:319,6, read_event_2:315,9, read_event_3:609,1]
Total partitions : 993050 [read_event_1:164882, read_event_2:166381, 
read_event_3:661787]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 189
total gc mb : 56464
total gc time (s) : 7
avg gc time(ms) : 37
stdev gc time(ms) : 8
Total operation time : 00:02:28{noformat}
 

 

debug off - no fix :
{noformat}
Results:
op rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
partition rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
row rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
latency mean : 10,1 [read_event_1:10,1, read_event_2:10,1, read_event_3:10,1]
latency median : 9,2 [read_event_1:9,2, read_event_2:9,2, read_event_3:9,3]
latency 95th percentile : 15,2 [read_event_1:15,8, read_event_2:15,9, 
read_event_3:15,7]
latency 99th percentile : 29,3 [read_event_1:44,5, read_event_2:45,1, 
read_event_3:41,3]
latency 99.9th percentile : 52,7 [read_event_1:67,9, read_event_2:66,9, 
read_event_3:67,1]
latency max : 268,0 [read_event_1:257,1, read_event_2:263,3, read_event_3:268,0]
Total partitions : 983056 [read_event_1:166311, read_event_2:162570, 
read_event_3:654175]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 100
total gc mb : 31529
total gc time (s) : 4
avg gc time(ms) : 37
stdev gc time(ms) : 5
Total operation time : 00:01:17{noformat}
 

 

debug on - with fix :
{noformat}
Results:
op rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
partition rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
row rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
latency mean : 10,4 [read_event_1:10,4, read_event_2:10,4, read_event_3:10,4]
latency median : 9,4 [read_event_1:9,4, read_event_2:9,4, read_event_3:9,4]
latency 95th percentile : 16,3 [read_event_1:16,8, read_event_2:17,3, 
read_event_3:16,2]
latency 99th percentile : 36,6 [read_event_1:44,3, read_event_2:46,6, 
read_event_3:37,2]
latency 99.9th percentile : 62,2 [read_event_1:78,0, read_event_2:77,1, 
read_event_3:80,8]
latency max : 251,2 [read_event_1:246,9, read_event_2:249,9, read_event_3:251,2]
Total partitions : 100 [read_event_1:167422, read_event_2:166861, 
read_event_3:665717]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 102
total gc mb : 31843
total gc time (s) : 4
avg gc time(ms) : 38
stdev gc time(ms) : 6
Total operation time : 00:01:21{noformat}
 

 

So we have similar performance with debug logging off and with the fix and 
debug on.
 The difference in throughput is pretty massive as we roughly get *twice the 
read throughput* with the fix.

Latencies without the fix and with the fix : 

p95 : 35ms -> 16ms
 p99 : 75ms -> 36ms

I've ran all tests several times, alternating with and without the fix to make 
sure caches were not making a difference, and results were consistent with 
what's pasted above.
 It's been running on a single node using an i3.xlarge instance for Cassandra 
and another i3.large for running cassandra-stress.

 

*One pretty interesting thing to note* is that when I tested with the 
predefined mode of cassandra-stress, no paging occurred and the performance 
difference was not noticeable. This is due to the fact that the predefined mode 
generates COMPACT STORAGE tables, which involve a different read path 
(apparently). I think anyone performing benchmarks for Cassandra changes should 
be aware that the predefined mode isn't relevant and that a user defined test 
should be used (maybe we should create one that would be used as standard 
benchmark). 
 Here's the one I used :

[jira] [Commented] (CASSANDRA-14318) Debug logging can create massive performance issues

2018-03-27 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415846#comment-16415846
 ] 

Alexander Dejanovski commented on CASSANDRA-14318:
--

[~jjirsa]: apparently the ReadCallback class already logs at TRACE and not 
DEBUG on the latest 2.2.

I've created the fix that downgrades debug logging to trace logging in the 
query pager classes, and here are the results : 

debug on - no fix : 
{noformat}
Results:
op rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
partition rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
row rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452]
latency mean : 19,1 [read_event_1:15,4, read_event_2:15,4, read_event_3:21,0]
latency median : 15,6 [read_event_1:14,2, read_event_2:14,0, read_event_3:16,3]
latency 95th percentile : 39,1 [read_event_1:28,4, read_event_2:28,6, 
read_event_3:44,2]
latency 99th percentile : 75,6 [read_event_1:52,9, read_event_2:53,6, 
read_event_3:87,7]
latency 99.9th percentile : 315,7 [read_event_1:101,0, read_event_2:110,1, 
read_event_3:361,1]
latency max : 609,1 [read_event_1:319,6, read_event_2:315,9, read_event_3:609,1]
Total partitions : 993050 [read_event_1:164882, read_event_2:166381, 
read_event_3:661787]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 189
total gc mb : 56464
total gc time (s) : 7
avg gc time(ms) : 37
stdev gc time(ms) : 8
Total operation time : 00:02:28{noformat}
 

 

debug off - no fix : 
{noformat}
Results:
op rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
partition rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
row rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422]
latency mean : 10,1 [read_event_1:10,1, read_event_2:10,1, read_event_3:10,1]
latency median : 9,2 [read_event_1:9,2, read_event_2:9,2, read_event_3:9,3]
latency 95th percentile : 15,2 [read_event_1:15,8, read_event_2:15,9, 
read_event_3:15,7]
latency 99th percentile : 29,3 [read_event_1:44,5, read_event_2:45,1, 
read_event_3:41,3]
latency 99.9th percentile : 52,7 [read_event_1:67,9, read_event_2:66,9, 
read_event_3:67,1]
latency max : 268,0 [read_event_1:257,1, read_event_2:263,3, read_event_3:268,0]
Total partitions : 983056 [read_event_1:166311, read_event_2:162570, 
read_event_3:654175]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 100
total gc mb : 31529
total gc time (s) : 4
avg gc time(ms) : 37
stdev gc time(ms) : 5
Total operation time : 00:01:17{noformat}
 

 

debug on - with fix : 
{noformat}
Results:
op rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
partition rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
row rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181]
latency mean : 10,4 [read_event_1:10,4, read_event_2:10,4, read_event_3:10,4]
latency median : 9,4 [read_event_1:9,4, read_event_2:9,4, read_event_3:9,4]
latency 95th percentile : 16,3 [read_event_1:16,8, read_event_2:17,3, 
read_event_3:16,2]
latency 99th percentile : 36,6 [read_event_1:44,3, read_event_2:46,6, 
read_event_3:37,2]
latency 99.9th percentile : 62,2 [read_event_1:78,0, read_event_2:77,1, 
read_event_3:80,8]
latency max : 251,2 [read_event_1:246,9, read_event_2:249,9, read_event_3:251,2]
Total partitions : 100 [read_event_1:167422, read_event_2:166861, 
read_event_3:665717]
Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0]
total gc count : 102
total gc mb : 31843
total gc time (s) : 4
avg gc time(ms) : 38
stdev gc time(ms) : 6
Total operation time : 00:01:21{noformat}
 

 

So we have similar performance with debug logging off and with the fix and 
debug on.
The difference in throughput is pretty massive as we roughly get *twice the 
read throughput* with the fix.

Latencies without the fix and with the fix : 

p95 : 35ms -> 16ms
p99 : 75ms -> 36ms

I've ran all tests several times, alternating with and without the fix to make 
sure caches were not making a difference, and results were consistent with 
what's pasted above.
It's been running on a single node using an i3.xlarge instance for Cassandra 
and another i3.large for running cassandra-stress.

 

*One pretty interesting thing to note* is that when I tested with the 
predefined mode of cassandra-stress, no paging occurred and the performance 
difference was not noticeable. This is due to the fact that the predefined mode 
generates COMPACT STORAGE tables, which involve a different read path 
(apparently). I think anyone performing benchmarks for Cassandra changes should 
be aware that the predefined mode isn't relevant and that a user defined test 
should be used (maybe we should create one that would be used as standard 
benchmark). 
Here's the one I used : [^cassandra-2.2-debug.yaml]

With the following commands for

[jira] [Updated] (CASSANDRA-14318) Debug logging can create massive performance issues

2018-03-27 Thread Alexander Dejanovski (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-14318:
-
Attachment: cassandra-2.2-debug.yaml

> Debug logging can create massive performance issues
> ---
>
> Key: CASSANDRA-14318
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14318
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Alexander Dejanovski
>Assignee: Alexander Dejanovski
>Priority: Major
>  Labels: lhf, performance
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 
> nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, 
> flame_graph_snapshot.png
>
>
> Debug logging can involve in many cases (especially very low latency ones) a 
> very important overhead on the read path in 2.2 as we've seen when upgrading 
> clusters from 2.0 to 2.2.
> The performance impact was especially noticeable on the client side metrics, 
> where p99 could go up to 10 times higher, while ClientRequest metrics 
> recorded by Cassandra didn't show any overhead.
> Below shows latencies recorded on the client side with debug logging on 
> first, and then without it :
> !debuglogging.png!  
> We generated a flame graph before turning off debug logging that shows the 
> read call stack is dominated by debug logging : 
> !flame_graph_snapshot.png!
> I've attached the original flame graph for exploration.
> Once disabled, the new flame graph shows that the read call stack gets 
> extremely thin, which is further confirmed by client recorded metrics : 
> !flame22 nodebug sjk svg.png!
> The query pager code has been reworked since 3.0 and it looks like 
> log.debug() calls are gone there, but for 2.2 users and to prevent such 
> issues to appear with default settings, I really think debug logging should 
> be disabled by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14326) Handle verbose logging at a different level than DEBUG

2018-03-20 Thread Alexander Dejanovski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406721#comment-16406721
 ] 

Alexander Dejanovski commented on CASSANDRA-14326:
--

 

I agree it would be nice to keep incremental loggings indeed so that verbose 
contains info + verbose, and debug contains info + verbose + debug, but then we 
would have to do 2 changes to enable debug logging at will : 
 * change  to 
 * uncomment the ASYNCDEBUGLOG appender

Otherwise :
 * if the appender is there we always have something that's written to 
debug.log (all INFO level stuff)
 * and if o.a.c is at DEBUG all the time, any call to logger.debug() will have 
to be in a conditional block to avoid the performance penalty of interpreting 
the calls and have the appender filter out debug stuff.

Unless there's a better way of achieving this ?

> Handle verbose logging at a different level than DEBUG
> --
>
> Key: CASSANDRA-14326
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14326
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alexander Dejanovski
>Priority: Major
> Fix For: 4.x
>
>
> CASSANDRA-10241 introduced debug logging turned on by default to act as a 
> verbose system.log and help troubleshoot production issues. 
> One of the consequence was to severely affect read performance in 2.2 as 
> contributors weren't all up to speed on how to use logging levels 
> (CASSANDRA-14318).
> As DEBUG level has a very specific meaning in dev, it is confusing to use it 
> for always on verbose logging and should probably not be used this way in 
> Cassandra.
> Options so far are :
>  # Bring back common loggings to INFO level (compactions, flushes, etc...) 
> and disable debug logging by default
>  # Use files named as verbose-system.log instead of debug.log and use a 
> custom logging level instead of DEBUG for verbose tracing, that would be 
> enabled by default. Debug logging would still exist and be disabled by 
> default in the root logger (not just filtered at the appender level).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

1 2 >

1 - 100 of 130 matches

Mail list logo