[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309294#comment-17309294 ] Alexander Dejanovski commented on CASSANDRA-16245: -- [~e.dimitrova], this work hasn't been formally reviewed. There's some flakiness [in the CI runs|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk] which is due to Medusa and S3 transient failures in downloads. This is being addressed in Medusa itself and should be fixed shortly [with this PR|https://github.com/thelastpickle/cassandra-medusa/pull/295]. It would be good to have someone validate that the test scenarios were implemented according to the description in order to close this ticket. Then we'll work on having this work integrated into Cassandra (or as a subproject) post-4.0. > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305923#comment-17305923 ] Alexander Dejanovski commented on CASSANDRA-16245: -- Thanks [~cscotta]! We're done with this ticket. There was some instability in the CI runs lately due to CASSANDRA-16478 which was fixed with CASSANDRA-16480 and it seems like it's passing nicely since then. > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7
[ https://issues.apache.org/jira/browse/CASSANDRA-16480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296005#comment-17296005 ] Alexander Dejanovski commented on CASSANDRA-16480: -- Thanks [~brandon.williams]! I've tested your branch by pushing [this commit|https://github.com/riptano/cassandra-rtest/commit/533b346584512133c782b39731d47c54fa1bb496] on previously failing 4.0 repair tests and [they passed successfully|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/118/workflows/0ccec847-e942-4486-ad38-750a825b2e7a]. LGTM (y) > cassandra-builds produce deb packages that require python 3.7 > - > > Key: CASSANDRA-16480 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16480 > Project: Cassandra > Issue Type: Bug > Components: Packaging >Reporter: Alexander Dejanovski >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > Since the builds moved from depending on python 2 to python 3, the packages > that are produced by the [cassandra-builds > project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be > installed on the target systems: > {noformat} > $ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb > (Reading database ... 117878 files and directories currently installed.) > Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ... > Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over > (4.0~beta5-20210303git25f3cf84f7) ... > dpkg: dependency problems prevent configuration of cassandra: > cassandra depends on python3 (>= 3.7~); however: > Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing > package cassandra (--install): > dependency problems - leaving unconfigured > Processing triggers for systemd (237-3ubuntu10.38) ... > Processing triggers for ureadahead (0.100.0-21) ... > Errors were encountered while processing: > cassandra{noformat} > The [test docker > images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36] > ship with both py36 and py38, which allows the install to pass nicely, but > on a vanilla Ubuntu Bionic system, only Python 3.6 is installed. > We need to use debian buster images for builds that ship with python 3.6 so > that the dependencies align with it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7
[ https://issues.apache.org/jira/browse/CASSANDRA-16480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294718#comment-17294718 ] Alexander Dejanovski commented on CASSANDRA-16480: -- Dropping support for 3.6 although it's still supported and is the default in Bionic (which will be supported until 2023) doesn't seem like the right move. It'll block folks from upgrading to 4.0 unless they upgrade their systems to Focal or install python 3.7 which is not that trivial for everyone, especially if they have a large fleet and other software that rely on 3.6. I'd vote for option 1 or 2. > cassandra-builds produce deb packages that require python 3.7 > - > > Key: CASSANDRA-16480 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16480 > Project: Cassandra > Issue Type: Bug > Components: Packaging >Reporter: Alexander Dejanovski >Assignee: Brandon Williams >Priority: Normal > Fix For: 4.0-beta > > > Since the builds moved from depending on python 2 to python 3, the packages > that are produced by the [cassandra-builds > project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be > installed on the target systems: > {noformat} > $ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb > (Reading database ... 117878 files and directories currently installed.) > Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ... > Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over > (4.0~beta5-20210303git25f3cf84f7) ... > dpkg: dependency problems prevent configuration of cassandra: > cassandra depends on python3 (>= 3.7~); however: > Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing > package cassandra (--install): > dependency problems - leaving unconfigured > Processing triggers for systemd (237-3ubuntu10.38) ... > Processing triggers for ureadahead (0.100.0-21) ... > Errors were encountered while processing: > cassandra{noformat} > The [test docker > images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36] > ship with both py36 and py38, which allows the install to pass nicely, but > on a vanilla Ubuntu Bionic system, only Python 3.6 is installed. > We need to use debian buster images for builds that ship with python 3.6 so > that the dependencies align with it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16480) cassandra-builds produce deb packages that require python 3.7
Alexander Dejanovski created CASSANDRA-16480: Summary: cassandra-builds produce deb packages that require python 3.7 Key: CASSANDRA-16480 URL: https://issues.apache.org/jira/browse/CASSANDRA-16480 Project: Cassandra Issue Type: Bug Reporter: Alexander Dejanovski Since the builds moved from depending on python 2 to python 3, the packages that are produced by the [cassandra-builds project|https://github.com/apache/cassandra-builds] expect Python 3.7 to be installed on the target systems: {noformat} $ sudo dpkg -i cassandra_4.0~beta5-20210303gitd29dd643df_all.deb (Reading database ... 117878 files and directories currently installed.) Preparing to unpack cassandra_4.0~beta5-20210303gitd29dd643df_all.deb ... Unpacking cassandra (4.0~beta5-20210303gitd29dd643df) over (4.0~beta5-20210303git25f3cf84f7) ... dpkg: dependency problems prevent configuration of cassandra: cassandra depends on python3 (>= 3.7~); however: Version of python3 on system is 3.6.7-1~18.04.dpkg: error processing package cassandra (--install): dependency problems - leaving unconfigured Processing triggers for systemd (237-3ubuntu10.38) ... Processing triggers for ureadahead (0.100.0-21) ... Errors were encountered while processing: cassandra{noformat} The [test docker images|https://github.com/apache/cassandra-builds/blob/trunk/docker/testing/ubuntu1910_j11.docker#L35-L36] ship with both py36 and py38, which allows the install to pass nicely, but on a vanilla Ubuntu Bionic system, only Python 3.6 is installed. We need to use debian buster images for builds that ship with python 3.6 so that the dependencies align with it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16478) Debian packages are broken since py3 migration
[ https://issues.apache.org/jira/browse/CASSANDRA-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16478: - Description: [Repair tests|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=alex%2Fupgrade-tlp-cluster-python3] started to fail after the builds moved to Python3 in CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic: {noformat} $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb Selecting previously unselected package cassandra. (Reading database ... 117650 files and directories currently installed.) Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ... Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ... dpkg: dependency problems prevent configuration of cassandra: cassandra depends on python (>= 3.6); however: Package python is not installed. cassandra depends on python3 (>= 3.7~); however: Version of python3 on system is 3.6.7-1~18.04.{noformat} It seems like the following requirements are not correct: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), ${misc:Depends}, ${python3:Depends}{noformat} I've changed this line to the following and got the deb packages to install correctly: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), ${misc:Depends}{noformat} was: Repair tests started to fail after the builds moved to Python3 in CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic: {noformat} $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb Selecting previously unselected package cassandra. (Reading database ... 117650 files and directories currently installed.) Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ... Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ... dpkg: dependency problems prevent configuration of cassandra: cassandra depends on python (>= 3.6); however: Package python is not installed. cassandra depends on python3 (>= 3.7~); however: Version of python3 on system is 3.6.7-1~18.04.{noformat} It seems like the following requirements are not correct: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), ${misc:Depends}, ${python3:Depends}{noformat} I've changed this line to the following and got the deb packages to install correctly: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), ${misc:Depends}{noformat} > Debian packages are broken since py3 migration > -- > > Key: CASSANDRA-16478 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16478 > Project: Cassandra > Issue Type: Bug > Components: Packaging >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > [Repair > tests|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=alex%2Fupgrade-tlp-cluster-python3] > started to fail after the builds moved to Python3 in CASSANDRA-16396 due to > deb packages failing to install on Ubuntu Bionic: > {noformat} > $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb > Selecting previously unselected package cassandra. > (Reading database ... 117650 files and directories currently installed.) > Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ... > Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ... > dpkg: dependency problems prevent configuration of cassandra: > cassandra depends on python (>= 3.6); however: > Package python is not installed. > cassandra depends on python3 (>= 3.7~); however: > Version of python3 on system is 3.6.7-1~18.04.{noformat} > It seems like the following requirements are not correct: > {noformat} > Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), > ${misc:Depends}, ${python3:Depends}{noformat} > I've changed this line to the following and got the deb packages to install > correctly: > {noformat} > Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), > ${misc:Depends}{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16478) Debian packages are broken since py3 migration
Alexander Dejanovski created CASSANDRA-16478: Summary: Debian packages are broken since py3 migration Key: CASSANDRA-16478 URL: https://issues.apache.org/jira/browse/CASSANDRA-16478 Project: Cassandra Issue Type: Bug Components: Packaging Reporter: Alexander Dejanovski Assignee: Alexander Dejanovski Repair tests started to fail after the builds moved to Python3 in CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic: {noformat} $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb Selecting previously unselected package cassandra. (Reading database ... 117650 files and directories currently installed.) Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ... Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ... dpkg: dependency problems prevent configuration of cassandra: cassandra depends on python (>= 3.6); however: Package python is not installed. cassandra depends on python3 (>= 3.7~); however: Version of python3 on system is 3.6.7-1~18.04.{noformat} It seems like the following requirements are not correct: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), ${misc:Depends}, ${python3:Depends}{noformat} I've changed this line to the following and got the deb packages to install correctly: {noformat} Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), ${misc:Depends}{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16478) Debian packages are broken since py3 migration
[ https://issues.apache.org/jira/browse/CASSANDRA-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16478: - Fix Version/s: 4.0-beta > Debian packages are broken since py3 migration > -- > > Key: CASSANDRA-16478 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16478 > Project: Cassandra > Issue Type: Bug > Components: Packaging >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > Repair tests started to fail after the builds moved to Python3 in > CASSANDRA-16396 due to deb packages failing to install on Ubuntu Bionic: > {noformat} > $ sudo dpkg -i cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb > Selecting previously unselected package cassandra. > (Reading database ... 117650 files and directories currently installed.) > Preparing to unpack cassandra_4.0~beta5-20210303git64f54f9fb0_all.deb ... > Unpacking cassandra (4.0~beta5-20210303git64f54f9fb0) ... > dpkg: dependency problems prevent configuration of cassandra: > cassandra depends on python (>= 3.6); however: > Package python is not installed. > cassandra depends on python3 (>= 3.7~); however: > Version of python3 on system is 3.6.7-1~18.04.{noformat} > It seems like the following requirements are not correct: > {noformat} > Depends: openjdk-8-jre-headless | java8-runtime, adduser, python (>= 3.6), > ${misc:Depends}, ${python3:Depends}{noformat} > I've changed this line to the following and got the deb packages to install > correctly: > {noformat} > Depends: openjdk-8-jre-headless | java8-runtime, adduser, python3 (>= 3.6), > ${misc:Depends}{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277996#comment-17277996 ] Alexander Dejanovski commented on CASSANDRA-16244: -- LGTM [~adelapena] (y) Thanks! > Create a jvm upgrade dtest for mixed versions repairs > - > > Key: CASSANDRA-16244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16244 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-rc > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Repair during upgrades should fail on mixed version clusters. > We'd need an in-jvm upgrade dtest to check that repair indeed fails as > expected with mixed current version+previous major version clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274439#comment-17274439 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Here's a [full green CI run|https://github.com/thelastpickle/cassandra-medusa/actions/runs/520478863] using the branch from this patch for 4.0. I've tried with different TLS settings (PROTOCOL_TLSv1 and PROTOCOL_TLSv1_2) and it worked in both cases. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17274193#comment-17274193 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Hey folks, sorry it took me a while to get to the bottom of it. The issue we were having was due to the [storage port being changed in our integration tests|https://github.com/thelastpickle/cassandra-medusa/blob/master/tests/integration/features/steps/integration_steps.py#L151] which apparently was not making ccm happy with 4.0 as the nodes wouldn't find themselves as seeds. I'm positive that [this worked in the past|https://github.com/thelastpickle/cassandra-medusa/runs/1449108534?check_suite_focus=true], so I have no clue why it suddenly started failing, nor why it would still pass locally on my laptop. There's definitely something fishy with the way some versions of ccm (I get lost between which branches do support 4.0 or not) deal with changing the storage port and how that impacts the seed list. But the good news is that as soon as I removed the storage port change, the [tests went green|https://github.com/thelastpickle/cassandra-medusa/runs/1789721180?check_suite_focus=true] using the C16362 branch (/) +1 for merge and I'll set up CI properly again in Medusa to get tests running on trunk. I'll try to investigate further the issue with CCM and the custom storage port. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273398#comment-17273398 ] Alexander Dejanovski commented on CASSANDRA-16406: -- Done, and I attached the updated patch to the ticket. > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: 16406-2-trunk.txt, CASSANDRA-16406.png, > with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Attachment: 16406-2-trunk.txt > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: 16406-2-trunk.txt, CASSANDRA-16406.png, > with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Attachment: (was: 16406-trunk.txt) > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: CASSANDRA-16406.png, with_debug_logging.png, > without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Test and Documentation Plan: Here's the CircleCI link for the patched run of the repair quality tests: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/73/workflows/0828fcad-26d2-43d6-9c8c-7a4102b0e31c] Full and Incremental test runs lasted 30 minutes less than the current trunk. Status: Patch Available (was: In Progress) > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: 16406-trunk.txt, CASSANDRA-16406.png, > with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow
[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273386#comment-17273386 ] Alexander Dejanovski commented on CASSANDRA-16406: -- I ran the repair quality test using the [patched branch|https://github.com/apache/cassandra/compare/trunk...adejanovski:CASSANDRA-16406?expand=1] and got the expected 30 minutes reduction on the full and incremental test suites: !CASSANDRA-16406.png! [PR|https://github.com/apache/cassandra/pull/881] [Branch|https://github.com/adejanovski/cassandra/tree/CASSANDRA-16406] I've attached the patch. > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: 16406-trunk.txt, CASSANDRA-16406.png, > with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Attachment: 16406-trunk.txt > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: 16406-trunk.txt, CASSANDRA-16406.png, > with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Attachment: CASSANDRA-16406.png > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: CASSANDRA-16406.png, with_debug_logging.png, > without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273053#comment-17273053 ] Alexander Dejanovski commented on CASSANDRA-16362: -- It's still failing in CI and passing locally. I don't have a clue why yet why it doesn't work there and GHA doesn't allow to SSH into the CI instances :( It seems more like a problem with our CI rather than your patch as I get it to pass locally. I'll spend some time on this issue tomorrow and update this ticket. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16245: - Authors: Alexander Dejanovski, Radovan Zvoncek (was: Radovan Zvoncek) Test and Documentation Plan: Perform repairs for a 3 nodes cluster using m5ad.xlarge instances. Repaired keyspaces will use RF=3 or RF=2 (the latter is for subranges with different sets of replicas). ||Mode||Version||Settings||Checks|| |Full repair|trunk|Sequential + All token ranges|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Force terminate repair shortly after it was triggered|Repair threads must be cleaned up| |Subrange repair|trunk|Sequential + single token range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Subrange repair|trunk|Parallel + 10 token ranges which have the same replicas|"No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range A single repair session will handle all subranges at once"| |Subrange repair|trunk|Parallel + 10 token ranges which have different replicas|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range More than one repair session is triggered to process all subranges"| |Incremental repair|trunk|"Parallel (mandatory) No compaction during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|"Parallel (mandatory) Major compaction triggered during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|Force terminate repair shortly after it was triggered.|Repair threads must be cleaned up| Status: Patch Available (was: In Progress) > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271532#comment-17271532 ] Alexander Dejanovski commented on CASSANDRA-15580: -- *Status update* CASSANDRA-16245 is close to done with [nightly CI runs|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk] scheduled already. CASSANDRA-16244 has a patch available which is under review. > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271530#comment-17271530 ] Alexander Dejanovski commented on CASSANDRA-16245: -- Status update: The test scenarios described in this ticket were implemented and are now scheduled for [nightly runs in CircleCI|https://app.circleci.com/pipelines/github/riptano/cassandra-rtest?branch=trunk] against trunk. We had to reduce the density per node to 20GB for now as the tests take a while to run already. We may generate additional data without adding more entropy to see how that impacts the execution times. [One last PR|https://github.com/riptano/cassandra-rtest/pull/4] is waiting to be merged to fix the code style and use the Cassandra code conventions, and also complement the push triggered CI runs with the CCM based test scenarios which are used for development purposes. [~vinaykumarcse], are you still willing to do a review on the code? I guess it can wait until we get a consensus on whether we integrate this repair test to the Cassandra repo or not, but I'd be happy to get your feedback already. > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271351#comment-17271351 ] Alexander Dejanovski commented on CASSANDRA-16244: -- Hi [~adelapena], looking at the patch it seems that we could hide the upgraded node behavior by timing out waiting for the message to show up each time. Correct me if I misunderstood, but the current behavior is: * Loop through all nodes one after the other * Start a repair using nodetool which will timeout after 10s but is expected to fail with a specific error message on the upgraded node * If no exception was triggered, check that the logs contain the expected message * Catch the TimeoutException and assume we're dealing with a non upgraded node Isn't it possible that the assumption we're dealing with a non upgraded node when we get a timeout could potentially hide some edge cases where the upgraded node doesn't behave as expected and goes into timeout? We could then possibly get the test to succeed although we're not getting the expected behavior. Let me know if I'm missing something. > Create a jvm upgrade dtest for mixed versions repairs > - > > Key: CASSANDRA-16244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16244 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-rc > > Time Spent: 20m > Remaining Estimate: 0h > > Repair during upgrades should fail on mixed version clusters. > We'd need an in-jvm upgrade dtest to check that repair indeed fails as > expected with mixed current version+previous major version clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271338#comment-17271338 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Awesome findings [~jmeredithco]! I managed to have the tests pass using your new branch on my laptop but they're still failing in CI for some odd reason: [https://github.com/thelastpickle/cassandra-medusa/runs/1762379270?check_suite_focus=true] I'll investigate further to see where the problem lies specifically and update here. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Fix Version/s: 4.0-rc > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > Attachments: with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands,
[jira] [Commented] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271317#comment-17271317 ] Alexander Dejanovski commented on CASSANDRA-16406: -- [~spod], seems like you added most of these debug loggings. Are you ok with me moving them to TRACE level? > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Attachments: with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB per node dataset was generating > more than 2G of log with a total duration for the incremental repair > scenarios of ~2h40m: > https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps > ] > !with_debug_logging.png! > The logs showed a lot of messages from the MerkleTree class at high pace: > {noformat} > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738651767434294419,-6738622715859497972] depth=11>, # (-6738622715859497972,-6738593664284701525] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) > Inconsistent digest on right sub-range # (-6738593664284701525,-6738535561135108630] depth=10>: [# -6738564612709905078 > hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] > children=[# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> > # [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, > # hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] > children=[# [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) > Hashing sub-ranges [# depth=11>, #] > for # divided > by midpoint -6738564612709905078 > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) > Inconsistent digest on left sub-range # (-6738593664284701525,-6738564612709905078] depth=11>: [# [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, > # [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) > Inconsistent digest on right sub-range # (-6738564612709905078,-6738535561135108630] depth=11>: [# [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, > # [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) > Fully inconsistent range [# (-6738593664284701525,-6738564612709905078] depth=11>, # (-6738564612709905078,-6738535561135108630] depth=11>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) > Fully inconsistent range [# (-6738651767434294419,-6738593664284701525] depth=10>, # (-6738593664284701525,-6738535561135108630] depth=10>] > DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) > Inconsistent digest on right sub-range # (-6738535561135108630,-6738419354835922841] depth=9>: [# -6738477457985515736 > hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] > children=[# hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] > children=[#]> # hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] > children=[#]>]>, # hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] > children=[# hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] > children=[#]> # hash=[adb59f5313473b44dd3b7fa697d72c7b23b3c0610f23670942e2c137878a] > children=[#]>]>]{noformat} > When disabling debug logging, the duration dropped to ~2h05m with decent log > sizes: > [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] > !without_debug_logging.png! > There's apparently too much logging for each inconsistency found in the > Merkle tree comparisons and we should move this to TRACE level if we still > want to allow debug logging to be turned on by default. > I'll prepare a patch for the MerkleTree class and run the repair testing > scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Updated] (CASSANDRA-16406) Debug logging affects repair performance
[ https://issues.apache.org/jira/browse/CASSANDRA-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16406: - Description: While working on the repair quality testing in CASSANDRA-16245, it appeared that the node coordinating repairs on a 20GB per node dataset was generating more than 2G of log with a total duration for the incremental repair scenarios of ~2h40m: https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps ] !with_debug_logging.png! The logs showed a lot of messages from the MerkleTree class at high pace: {noformat} DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) Inconsistent digest on right sub-range #: [# #]>, # #]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) Hashing sub-ranges [#, #] for # divided by midpoint -6738564612709905078 DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) Inconsistent digest on left sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) Inconsistent digest on right sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat} When disabling debug logging, the duration dropped to ~2h05m with decent log sizes: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] !without_debug_logging.png! There's apparently too much logging for each inconsistency found in the Merkle tree comparisons and we should move this to TRACE level if we still want to allow debug logging to be turned on by default. I'll prepare a patch for the MerkleTree class and run the repair testing scenarios again to verify their duration. was: While working on the repair quality testing in CASSANDRA-16245, it appeared that the node coordinating repairs on a 20GB per node dataset was generating more than 2G of log with a total duration for the incremental repair scenarios of ~2h40m: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps ] !with_debug_logging.png! The logs showed a lot of messages from the MerkleTree class at high pace: {noformat} DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) Inconsistent digest on right sub-range #: [# #]>, # #]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) Hashing sub-ranges [#, #] for # divided by midpoint -6738564612709905078 DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) Inconsistent digest on left sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) Inconsistent digest on right sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat} When disabling debug logging, the duration dropped to ~2h05m with decent log sizes: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] !without_debug_logging.png! There's apparently too much logging for each inconsistency found in the Merkle tree comparisons and we should move this to TRACE level if we still want to allow debug logging to be turned on by default. I'll prepare a patch for the MerkleTree class and run the repair testing scenarios again to verify their duration. > Debug logging affects repair performance > > > Key: CASSANDRA-16406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Attachments: with_debug_logging.png, without_debug_logging.png > > > While working on the repair quality testing in CASSANDRA-16245, it appeared > that the node coordinating repairs on a 20GB
[jira] [Created] (CASSANDRA-16406) Debug logging affects repair performance
Alexander Dejanovski created CASSANDRA-16406: Summary: Debug logging affects repair performance Key: CASSANDRA-16406 URL: https://issues.apache.org/jira/browse/CASSANDRA-16406 Project: Cassandra Issue Type: Bug Components: Consistency/Repair Reporter: Alexander Dejanovski Assignee: Alexander Dejanovski Attachments: with_debug_logging.png, without_debug_logging.png While working on the repair quality testing in CASSANDRA-16245, it appeared that the node coordinating repairs on a 20GB per node dataset was generating more than 2G of log with a total duration for the incremental repair scenarios of ~2h40m: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps ] !with_debug_logging.png! The logs showed a lot of messages from the MerkleTree class at high pace: {noformat} DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) Inconsistent digest on right sub-range #: [# #]>, # #]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) Hashing sub-ranges [#, #] for # divided by midpoint -6738564612709905078 DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) Inconsistent digest on left sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) Inconsistent digest on right sub-range #: [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully inconsistent range [#, #] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) Inconsistent digest on right sub-range #: [# #]>, # #]>]{noformat} When disabling debug logging, the duration dropped to ~2h05m with decent log sizes: [https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51] !without_debug_logging.png! There's apparently too much logging for each inconsistency found in the Merkle tree comparisons and we should move this to TRACE level if we still want to allow debug logging to be turned on by default. I'll prepare a patch for the MerkleTree class and run the repair testing scenarios again to verify their duration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16245: - Description: Implement the following test scenarios in a new test suite for repair integration testing with significant load: Generate/restore a workload of ~100GB per node. Medusa should be considered to create the initial backup which could then be restored from an S3 bucket to speed up node population. Data should on purpose require repair and be generated accordingly. Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM (m5d.xlarge instances would be the most cost efficient type). Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for subranges with different sets of replicas). ||Mode||Version||Settings||Checks|| |Full repair|trunk|Sequential + All token ranges|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Force terminate repair shortly after it was triggered|Repair threads must be cleaned up| |Subrange repair|trunk|Sequential + single token range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Subrange repair|trunk|Parallel + 10 token ranges which have the same replicas|"No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range A single repair session will handle all subranges at once"| |Subrange repair|trunk|Parallel + 10 token ranges which have different replicas|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range More than one repair session is triggered to process all subranges"| |Incremental repair|trunk|"Parallel (mandatory) No compaction during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|"Parallel (mandatory) Major compaction triggered during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|Force terminate repair shortly after it was triggered.|Repair threads must be cleaned up| was: Implement the following test scenarios in a new test suite for repair integration testing with significant load: Generate/restore a workload of ~100GB per node. Medusa should be considered to create the initial backup which could then be restored from an S3 bucket to speed up node population. Data should on purpose require repair and be generated accordingly. Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM (m5d.xlarge instances would be the most cost efficient type). Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for subranges with different sets of replicas). ||Mode||Version||Settings||Checks|| |Full repair|trunk|Sequential + All token ranges|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Force terminate repair shortly after it was triggered|Repair threads must be cleaned up| |Subrange repair|trunk|Sequential + single token range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Subrange repair|trunk|Parallel + 10 token ranges which have the same replicas|"No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range A single repair session will handle all subranges at once"| |Subrange repair|trunk|Parallel + 10 token ranges which have different replicas|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range More than one repair session is triggered to process all subranges"| |Subrange repair|trunk|"Single token range. Force terminate repair shortly after it was triggered."|Repair threads must be cleaned up| |Incremental repair|trunk|"Parallel (mandatory) No compaction during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"|
[jira] [Commented] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269392#comment-17269392 ] Alexander Dejanovski commented on CASSANDRA-16244: -- Thanks for picking this up [~adelapena]! :) > Create a jvm upgrade dtest for mixed versions repairs > - > > Key: CASSANDRA-16244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16244 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-rc > > > Repair during upgrades should fail on mixed version clusters. > We'd need an in-jvm upgrade dtest to check that repair indeed fails as > expected with mixed current version+previous major version clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268819#comment-17268819 ] Alexander Dejanovski commented on CASSANDRA-16362: -- It's our fault actually. Please use the following branch of Medusa to be able to use forks as Cassandra base for the ccm clusters: https://github.com/thelastpickle/cassandra-medusa/tree/alex/CASSANDRA-16362 The master branch will only accept {{github:apache/...}} versions. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268732#comment-17268732 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Hi [~jmeredithco], very sorry for not responding earlier, I'm heads down on 4.0 repair quality testing at the moment. A colleague of mine is working on giving you steps to reproduce the issue with ccm and will comment here soon with the instructions. For Medusa integration tests, there have been issues with the sstableloader test (scenario 11) which was fixed by CASSANDRA-16280. I manage to get the scenario 11 passing with beta3: {code:java} (py36) adejanovski@mac-alex-2 cassandra-medusa % ./run_integration_tests.sh -t 11 --cassandra-version=4.0-beta3 ... ... @11 @local Scenario Outline: Perform a backup, and restore it using the sstableloader -- @1.1 Local storage # integration/features/integration_tests.feature:450 Given I have a fresh ccm cluster "with_client_encryption" running named "scenario11"# features/steps/integration_steps.py:125 Given I have a fresh ccm cluster "with_client_encryption" running named "scenario11"# features/steps/integration_steps.py:125 22.497s Given I am using "local" as storage provider in ccm cluster "with_client_encryption"# features/steps/integration_steps.py:235 0.052s When I create the "test" table in keyspace "medusa" # features/steps/integration_steps.py:511 0.122s When I load 100 rows in the "medusa.test" table # features/steps/integration_steps.py:534 0.192s When I run a "ccm node1 nodetool flush" command # features/steps/integration_steps.py:542 1.508s When I load 100 rows in the "medusa.test" table # features/steps/integration_steps.py:534 0.160s When I run a "ccm node1 nodetool flush" command # features/steps/integration_steps.py:542 1.445s When I perform a backup in "full" mode of the node named "first_backup" # features/steps/integration_steps.py:547 3.208s Then I can see the backup named "first_backup" when I list the backups # features/steps/integration_steps.py:591 0.014s Then I can verify the backup named "first_backup" successfully # features/steps/integration_steps.py:655 0.029s When I load 100 rows in the "medusa.test" table # features/steps/integration_steps.py:534 0.135s When I run a "ccm node1 nodetool flush" command # features/steps/integration_steps.py:542 1.373s Then I have 300 rows in the "medusa.test" table in ccm cluster "with_client_encryption" # features/steps/integration_steps.py:766 0.119s When I truncate the "medusa.test" table in ccm cluster "with_client_encryption" # features/steps/integration_steps.py:1040 0.167s When I restore the backup named "first_backup" with the sstableloader # features/steps/integration_steps.py:734 20.214s Then I have 200 rows in the "medusa.test" table in ccm cluster "with_client_encryption" # features/steps/integration_steps.py:766 0.079s ... ... 1 feature passed, 0 failed, 0 skipped 1 scenario passed, 0 failed, 59 skipped 16 steps passed, 0 failed, 1039 skipped, 0 undefined Took 0m51.312s {code} But it fails with beta4 due to the issue reported in this very ticket: {noformat} ./run_integration_tests.sh -t 11 --cassandra-version=4.0-beta4 ... ... @11 @local Scenario Outline: Perform a backup, and restore it using the sstableloader -- @1.1 Local storage # integration/features/integration_tests.feature:450 Given I have a fresh ccm cluster "with_client_encryption" running named "scenario11"# features/steps/integration_steps.py:125 Given I have a fresh ccm cluster "with_client_encryption" running named "scenario11"# features/steps/integration_steps.py:125 22.836s Given I am using "local" as storage provider in ccm cluster "with_client_encryption"# features/steps/integration_steps.py:235 0.053s When I create the "test" table in keyspace "medusa" # features/steps/integration_steps.py:511 0.113s When I load 100 rows in the "medusa.test" table # features/steps/integration_steps.py:534 0.229s When I run a "ccm node1 nodetool flush" command # features/steps/integration_steps.py:542 1.424s When I load 100 rows in the "medusa.test" table
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262634#comment-17262634 ] Alexander Dejanovski commented on CASSANDRA-16362: -- [~jmeredithco], it's now failing earlier in the process as I can't get the Python CQL driver to connect to Cassandra when encryption is turned on. I've tried using TLS 1.1 and TLS 1.2 with the same result... Previously we were failing later in the process as we could connect using the Python driver but failed at using the sstableloader. Any idea of what could be preventing us from connecting with TLS 1.1 and 1.2 when using the python driver? Our implementation follows what's described in [this driver documentation page|https://docs.datastax.com/en/developer/python-driver/3.24/security/#ssl-configuration-examples]. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 50m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261431#comment-17261431 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Sure thing, I'll run a test ASAP. > SSLFactory should initialize SSLContext before setting protocols > > > Key: CASSANDRA-16362 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16362 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Erik Merkle >Assignee: Jon Meredith >Priority: Normal > Fix For: 4.0-beta5 > > Time Spent: 50m > Remaining Estimate: 0h > > Trying to use sstableloader from the latest trunk produced the following > Exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: Could not create SSL > Context. > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:261) > at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:64) > at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:49) > Caused by: java.io.IOException: Error creating/initializing the SSL Context > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:184) > at > org.apache.cassandra.tools.BulkLoader.buildSSLOptions(BulkLoader.java:257) > ... 2 more > Caused by: java.lang.IllegalStateException: SSLContext is not initialized > at > sun.security.ssl.SSLContextImpl.engineGetSocketFactory(SSLContextImpl.java:208) > at javax.net.ssl.SSLContextSpi.getDefaultSocket(SSLContextSpi.java:158) > at > javax.net.ssl.SSLContextSpi.engineGetDefaultSSLParameters(SSLContextSpi.java:184) > at javax.net.ssl.SSLContext.getDefaultSSLParameters(SSLContext.java:435) > at > org.apache.cassandra.security.SSLFactory.createSSLContext(SSLFactory.java:178) > ... 3 more > {quote} > I believe this is because of a change to SSLFactory for CASSANDRA-13325 here: > [https://github.com/apache/cassandra/commit/919a8964a83511d96766c3e53ba603e77bca626c#diff-0d569398cfd58566fc56bfb80c971a72afe3f392addc2df731a0b44baf29019eR177-R178] > > I think the solution is to call {{ctx.init()}} before trying to call > {{ctx.getDefaultSSLParameters()}}, essentialy swapping the two lines in the > link above. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16362) SSLFactory should initialize SSLContext before setting protocols
[ https://issues.apache.org/jira/browse/CASSANDRA-16362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253347#comment-17253347 ] Alexander Dejanovski commented on CASSANDRA-16362: -- Hi [~jmeredithco], thanks for issuing a patch. I tested it with Medusa's integration tests and now get the following error: {noformat} WARN 09:57:44,993 Failed to initialize a channel. Closing: [id: 0x61e6eef5] java.lang.IllegalArgumentException: TLSv1.3 at sun.security.ssl.ProtocolVersion.valueOf(ProtocolVersion.java:187) at sun.security.ssl.ProtocolList.convert(ProtocolList.java:84) at sun.security.ssl.ProtocolList.(ProtocolList.java:52) at sun.security.ssl.SSLEngineImpl.setEnabledProtocols(SSLEngineImpl.java:2081) at org.apache.cassandra.tools.BulkLoader$1.newSSLEngine(BulkLoader.java:276) at com.datastax.driver.core.RemoteEndpointAwareJdkSSLOptions.newSSLHandler(RemoteEndpointAwareJdkSSLOptions.java:62) at com.datastax.driver.core.Connection$Initializer.initChannel(Connection.java:1700) at com.datastax.driver.core.Connection$Initializer.initChannel(Connection.java:1644) at com.datastax.shaded.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:113) at com.datastax.shaded.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:105) at com.datastax.shaded.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:593) at com.datastax.shaded.netty.channel.DefaultChannelPipeline.access$000(DefaultChannelPipeline.java:44) at com.datastax.shaded.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1357) at com.datastax.shaded.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1092) at com.datastax.shaded.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:642) at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:456) at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:378) at com.datastax.shaded.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:428) at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464) at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1:9042] Cannot connect)) com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1:9042] Cannot connect)) at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:268) at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:107) at com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect(Cluster.java:1813) at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1726) at com.datastax.driver.core.Cluster.init(Cluster.java:214) at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:387) at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:366) at com.datastax.driver.core.Cluster.connect(Cluster.java:311) at org.apache.cassandra.utils.NativeSSTableLoaderClient.init(NativeSSTableLoaderClient.java:75) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:183) at org.apache.cassandra.tools.BulkLoader.load(BulkLoader.java:79) at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:51) {noformat} Here's the sstableloader command that is being issued: {noformat} subprocess.CalledProcessError: Command '['/Users/adejanovski/.ccm/repository/githubCOLONjonmeredithSLASHC16362/bin/sstableloader', '-d', '127.0.0.1', '--conf-path', '/Users/adejanovski/.ccm/scenario11/node1/conf/cassandra.yaml', '--username', 'cassandra', '--password', 'cassandra', '--no-progress', '/tmp/medusa-restore-97ec3e11-426a-4924-8bc0-379e99ff2205/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627', '-ts', '/Users/adejanovski/projets/cassandra/thelastpickle/cassandra-medusa/tests/resources/local_with_ssl/generic-server-truststore.jks', '-tspw', 'truststorePass1',
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247255#comment-17247255 ] Alexander Dejanovski commented on CASSANDRA-15580: -- Thanks for the feedback [~vinaykumarcse]! We made some progress in CASSANDRA-16245 with an implementation of the test scenarios using Cucumber, and ccm to spin up a cluster. We're in the process of wiring it up with tlp-cluster to work on an actual AWS cluster instead for 100 GB per node density testing. Hopefully we'll have a first fully running version next week. > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247249#comment-17247249 ] Alexander Dejanovski commented on CASSANDRA-16245: -- Hi [~zvo], Awesome stuff so far! I've pushed a GitHub Actions workflow which spins up/tears down a 3 node cluster in AWS using m5ad.xlarge instances (4 vCPUs, 16G RAM and 150GB of direct attached storage). They provide a 140GB SSD drive which is mounted as {{/var/lib/cassandra}} by tlp-cluster. Let's start with a dataset of 100GB per node for our testing, which should be good enough for now. The test suite needs to be adjusted to target the "real" cluster instead of a ccm one, and tlp-cluster provides environment variables with each node's public IP in the {{env.sh}} file ({{source env.sh}} sets the variables along with the other tlp-cluster aliases). Could you rename the branch you're working on to {{CASSANDRA-16245}}? Let me know if you have what you need to move this forward. > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Subrange repair|trunk|"Single token range. > Force terminate repair shortly after it was triggered."|Repair threads must > be cleaned up| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234403#comment-17234403 ] Alexander Dejanovski commented on CASSANDRA-15580: -- Sounds good [~marcuse], thanks for the notice. Work started on CASSANDRA-16245 to implement the new test suite. If anyone's interested in picking up CASSANDRA-16244 it would be greatly appreciated! > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234388#comment-17234388 ] Alexander Dejanovski commented on CASSANDRA-16245: -- Dev repo was created here for anyone interested: https://github.com/riptano/cassandra-rtest > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Subrange repair|trunk|"Single token range. > Force terminate repair shortly after it was triggered."|Repair threads must > be cleaned up| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234384#comment-17234384 ] Alexander Dejanovski commented on CASSANDRA-16245: -- [~zvo], I'll write up the Gherkin files with the test scenarios so that you can implement the test steps. As agreed upon, we can work in a separate repo for initial devs and integrate the code into the Cassandra repo once we have something to show. > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Subrange repair|trunk|"Single token range. > Force terminate repair shortly after it was triggered."|Repair threads must > be cleaned up| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem
[ https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234355#comment-17234355 ] Alexander Dejanovski commented on CASSANDRA-15584: -- CASSANDRA-16280 was committed to fix the sstableloader issues in Cassandra and the Medusa PR fixing the tests was merged. We're done with Medusa (table updated). > 4.0 quality testing: Tooling - External Ecosystem > - > > Key: CASSANDRA-15584 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15584 > Project: Cassandra > Issue Type: Task > Components: Tool/external >Reporter: Josh McKenzie >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Benjamin Lerer* > Many users of Apache Cassandra employ open source tooling to automate > Cassandra configuration, runtime management, and repair scheduling. Prior to > release, we need to confirm that popular third-party tools function properly. > Current list of tools: > || Name || Status || Contact || > | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH > ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| > | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | > *NOT STARTED* | [~stefan.miklosovic]| > | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| > *NOT STARTED* | [~stefan.miklosovic]| > | [Instaclustr Cassandra > operator|https://github.com/instaclustr/cassandra-operator]| > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Instaclustr Esop | > https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Instaclustr Icarus | > https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Cassandra SSTable generator | > https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| > [~stefan.miklosovic]| > | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Cassandra Everywhere Strategy | > https://github.com/instaclustr/cassandra-everywhere-strategy] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Cassandra LDAP Authenticator | > https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Instaclustr Minotaur | > https://github.com/instaclustr/instaclustr-minotaur] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | > [~adejanovski]| > | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| > {color:#00875A}*DONE*{color}| [~adejanovski]| > | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| > Franck Dehay| > | > [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]| > {color:#00875A}*DONE*{color}| [~jtgrabowski]| > | [cass operator|https://github.com/datastax/cass-operator]| > {color:#00875A}*DONE*{color}| [~jimdickinson]| > | [metric > collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > | [managment > API|https://github.com/datastax/management-api-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > Columns descriptions: > * *Name*: Name and link to the tool official page > * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any > issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if > testing 4.0 is part of your CI process. > * *Contact*: The person acting as the contact point for that tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem
[ https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15584: - Description: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: Benjamin Lerer* Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools function properly. Current list of tools: || Name || Status || Contact || | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT STARTED* | [~stefan.miklosovic]| | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT STARTED* | [~stefan.miklosovic]| | [Instaclustr Cassandra operator|https://github.com/instaclustr/cassandra-operator]| {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Esop | https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Icarus | https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra SSTable generator | https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| [~stefan.miklosovic]| | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra Everywhere Strategy | https://github.com/instaclustr/cassandra-everywhere-strategy] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra LDAP Authenticator | https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Minotaur | https://github.com/instaclustr/instaclustr-minotaur] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | [~adejanovski]| | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| {color:#00875A}*DONE*{color}| [~adejanovski]| | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck Dehay| | [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]| {color:#00875A}*DONE*{color}| [~jtgrabowski]| | [cass operator|https://github.com/datastax/cass-operator]| {color:#00875A}*DONE*{color}| [~jimdickinson]| | [metric collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| {color:#00875A}*DONE*{color}| [~tjake]| | [managment API|https://github.com/datastax/management-api-for-apache-cassandra]| {color:#00875A}*DONE*{color}| [~tjake]| Columns descriptions: * *Name*: Name and link to the tool official page * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if testing 4.0 is part of your CI process. * *Contact*: The person acting as the contact point for that tool. was: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: Benjamin Lerer* Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools function properly. Current list of tools: || Name || Status || Contact || | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT STARTED* | [~stefan.miklosovic]| | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT STARTED* | [~stefan.miklosovic]| | [Instaclustr Cassandra operator|https://github.com/instaclustr/cassandra-operator]| {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Esop | https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Icarus | https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra SSTable generator | https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| [~stefan.miklosovic]| | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra Everywhere Strategy | https://github.com/instaclustr/cassandra-everywhere-strategy] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra LDAP Authenticator | https://github.com/instaclustr/cassandra-ldap] |
[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem
[ https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233676#comment-17233676 ] Alexander Dejanovski commented on CASSANDRA-15584: -- I found two problems when investigating the failing tests with Medusa: * pre-4.0 Cassandra was apparently more permissive with missing ciphers when sstableloader was invoked. I've added the path to cassandra.yaml in the sstableloader call which fixed the issue (PR pending merge). * [CASSANDRA-16144|https://issues.apache.org/jira/browse/CASSANDRA-16144] recently introduced a bug in parsing loader options with encryption args. I've created [CASSANDRA-16280|https://issues.apache.org/jira/browse/CASSANDRA-16280] to track the issue along with a fix. I'll update the table in this ticket's description once trunk is fixed and the Medusa PR gets merged. > 4.0 quality testing: Tooling - External Ecosystem > - > > Key: CASSANDRA-15584 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15584 > Project: Cassandra > Issue Type: Task > Components: Tool/external >Reporter: Josh McKenzie >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Benjamin Lerer* > Many users of Apache Cassandra employ open source tooling to automate > Cassandra configuration, runtime management, and repair scheduling. Prior to > release, we need to confirm that popular third-party tools function properly. > Current list of tools: > || Name || Status || Contact || > | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH > ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| > | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | > *NOT STARTED* | [~stefan.miklosovic]| > | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| > *NOT STARTED* | [~stefan.miklosovic]| > | [Instaclustr Cassandra > operator|https://github.com/instaclustr/cassandra-operator]| > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Instaclustr Esop | > https://github.com/instaclustr/instaclustr-esop]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Instaclustr Icarus | > https://github.com/instaclustr/instaclustr-icarus]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Cassandra SSTable generator | > https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| > [~stefan.miklosovic]| > | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Cassandra Everywhere Strategy | > https://github.com/instaclustr/cassandra-everywhere-strategy] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Cassandra LDAP Authenticator | > https://github.com/instaclustr/cassandra-ldap] | {color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Instaclustr Minotaur | > https://github.com/instaclustr/instaclustr-minotaur] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | > [~adejanovski]| > | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *IN PROGRESS*| > [~adejanovski]| > | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| > Franck Dehay| > | > [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]| > {color:#00875A}*DONE*{color}| [~jtgrabowski]| > | [cass operator|https://github.com/datastax/cass-operator]| > {color:#00875A}*DONE*{color}| [~jimdickinson]| > | [metric > collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > | [managment > API|https://github.com/datastax/management-api-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > Columns descriptions: > * *Name*: Name and link to the tool official page > * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any > issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if > testing 4.0 is part of your CI process. > * *Contact*: The person acting as the contact point for that tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144
[ https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16280: - Test and Documentation Plan: Regression test added under {{LoaderOptionsTest.testEncryptionSettings}}, invoking {{LoaderOptions.builder().parseArgs()}} with all the encryption options. Failure with the current trunk: {code:java} test: [echo] Number of test runners: 3 [mkdir] Created dir: /Users/adejanovski/projets/cassandra/thelastpickle/cassandra/build/test/cassandra [mkdir] Created dir: /Users/adejanovski/projets/cassandra/thelastpickle/cassandra/build/test/output [junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest [junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0,824 sec [junit-timeout] [junit-timeout] Testcase: testEncryptionSettings(org.apache.cassandra.tools.LoaderOptionsTest): Caused an ERROR [junit-timeout] EncryptionOptions cannot be changed after configuration applied [junit-timeout] java.lang.IllegalStateException: EncryptionOptions cannot be changed after configuration applied [junit-timeout] at org.apache.cassandra.config.EncryptionOptions.ensureConfigNotApplied(EncryptionOptions.java:162) [junit-timeout] at org.apache.cassandra.config.EncryptionOptions.applyConfig(EncryptionOptions.java:130) [junit-timeout] at org.apache.cassandra.tools.LoaderOptions$Builder.parseArgs(LoaderOptions.java:478) [junit-timeout] at org.apache.cassandra.tools.LoaderOptionsTest.testEncryptionSettings(LoaderOptionsTest.java:55) [junit-timeout] [junit-timeout] [junit-timeout] Test org.apache.cassandra.tools.LoaderOptionsTest FAILED {code} The test passes with the patch: {code:java} test: [echo] Number of test runners: 3 [junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest [junit-timeout] Testsuite: org.apache.cassandra.tools.LoaderOptionsTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0,5 sec BUILD SUCCESSFUL {code} Status: Patch Available (was: In Progress) Here's the patch: * [branch|https://github.com/thelastpickle/cassandra/tree/CASSANDRA-16280] * [commit|https://github.com/thelastpickle/cassandra/commit/dbce40a06d89c415cbe172e4726b6c4bb38fe4c9] I'm waiting for the build to go through in CircleCI. > SSTableLoader will fail if encryption parameters are used due to > CASSANDRA-16144 > > > Key: CASSANDRA-16280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16280 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-16144 recently introduced [repeated calls > |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to > _clientEncOptions.applyConfig()_ for each encryption parameter passed to the > sstableloader command line. > This consistently fails because _applyConfig()_ can be called only once due > to the _ensureConfigNotApplied()_ check at the beginning of the method. > This call is not necessary since the _with...()_ methods will invoke > _applyConfig()_ each time: > {code:java} > public EncryptionOptions withTrustStore(String truststore) > { > return new EncryptionOptions(keystore, keystore_password, truststore, > truststore_password, cipher_suites, > protocol, algorithm, store_type, > require_client_auth, require_endpoint_verification, > enabled, optional).applyConfig(); > } > {code} > I'll build a patch for this with the appropriate unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144
[ https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16280: - Bug Category: Parent values: Availability(12983)Level 1 values: Process Crash(12992) Complexity: Normal Discovered By: User Report Severity: Critical Status: Open (was: Triage Needed) > SSTableLoader will fail if encryption parameters are used due to > CASSANDRA-16144 > > > Key: CASSANDRA-16280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16280 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-16144 recently introduced [repeated calls > |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to > _clientEncOptions.applyConfig()_ for each encryption parameter passed to the > sstableloader command line. > This consistently fails because _applyConfig()_ can be called only once due > to the _ensureConfigNotApplied()_ check at the beginning of the method. > This call is not necessary since the _with...()_ methods will invoke > _applyConfig()_ each time: > {code:java} > public EncryptionOptions withTrustStore(String truststore) > { > return new EncryptionOptions(keystore, keystore_password, truststore, > truststore_password, cipher_suites, > protocol, algorithm, store_type, > require_client_auth, require_endpoint_verification, > enabled, optional).applyConfig(); > } > {code} > I'll build a patch for this with the appropriate unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144
[ https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16280: - Fix Version/s: 4.0-beta > SSTableLoader will fail if encryption parameters are used due to > CASSANDRA-16144 > > > Key: CASSANDRA-16280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16280 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-16144 recently introduced [repeated calls > |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to > _clientEncOptions.applyConfig()_ for each encryption parameter passed to the > sstableloader command line. > This consistently fails because _applyConfig()_ can be called only once due > to the _ensureConfigNotApplied()_ check at the beginning of the method. > This call is not necessary since the _with...()_ methods will invoke > _applyConfig()_ each time: > {code:java} > public EncryptionOptions withTrustStore(String truststore) > { > return new EncryptionOptions(keystore, keystore_password, truststore, > truststore_password, cipher_suites, > protocol, algorithm, store_type, > require_client_auth, require_endpoint_verification, > enabled, optional).applyConfig(); > } > {code} > I'll build a patch for this with the appropriate unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144
[ https://issues.apache.org/jira/browse/CASSANDRA-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16280: - Component/s: Tool/bulk load > SSTableLoader will fail if encryption parameters are used due to > CASSANDRA-16144 > > > Key: CASSANDRA-16280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16280 > Project: Cassandra > Issue Type: Bug > Components: Tool/bulk load >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > > CASSANDRA-16144 recently introduced [repeated calls > |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to > _clientEncOptions.applyConfig()_ for each encryption parameter passed to the > sstableloader command line. > This consistently fails because _applyConfig()_ can be called only once due > to the _ensureConfigNotApplied()_ check at the beginning of the method. > This call is not necessary since the _with...()_ methods will invoke > _applyConfig()_ each time: > {code:java} > public EncryptionOptions withTrustStore(String truststore) > { > return new EncryptionOptions(keystore, keystore_password, truststore, > truststore_password, cipher_suites, > protocol, algorithm, store_type, > require_client_auth, require_endpoint_verification, > enabled, optional).applyConfig(); > } > {code} > I'll build a patch for this with the appropriate unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16280) SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144
Alexander Dejanovski created CASSANDRA-16280: Summary: SSTableLoader will fail if encryption parameters are used due to CASSANDRA-16144 Key: CASSANDRA-16280 URL: https://issues.apache.org/jira/browse/CASSANDRA-16280 Project: Cassandra Issue Type: Bug Reporter: Alexander Dejanovski Assignee: Alexander Dejanovski CASSANDRA-16144 recently introduced [repeated calls |https://github.com/apache/cassandra/compare/trunk...dcapwell:commit_remote_branch/CASSANDRA-16144-trunk-209E2350-3A50-457E-A466-F2661CD0D4D1#diff-b87acacbdc34464d327446f7a7e64718dbf843d70f5fbc9e5ddcd1bafca0f441R478]to _clientEncOptions.applyConfig()_ for each encryption parameter passed to the sstableloader command line. This consistently fails because _applyConfig()_ can be called only once due to the _ensureConfigNotApplied()_ check at the beginning of the method. This call is not necessary since the _with...()_ methods will invoke _applyConfig()_ each time: {code:java} public EncryptionOptions withTrustStore(String truststore) { return new EncryptionOptions(keystore, keystore_password, truststore, truststore_password, cipher_suites, protocol, algorithm, store_type, require_client_auth, require_endpoint_verification, enabled, optional).applyConfig(); } {code} I'll build a patch for this with the appropriate unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski reassigned CASSANDRA-16245: Assignee: Radovan Zvoncek > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Assignee: Radovan Zvoncek >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Subrange repair|trunk|"Single token range. > Force terminate repair shortly after it was triggered."|Repair threads must > be cleaned up| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16245) Implement repair quality test scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16245: - Fix Version/s: 4.0-rc > Implement repair quality test scenarios > --- > > Key: CASSANDRA-16245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Implement the following test scenarios in a new test suite for repair > integration testing with significant load: > Generate/restore a workload of ~100GB per node. Medusa should be considered > to create the initial backup which could then be restored from an S3 bucket > to speed up node population. > Data should on purpose require repair and be generated accordingly. > Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM > (m5d.xlarge instances would be the most cost efficient type). > Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for > subranges with different sets of replicas). > ||Mode||Version||Settings||Checks|| > |Full repair|trunk|Sequential + All token ranges|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Full repair|trunk|Force terminate repair shortly after it was > triggered|Repair threads must be cleaned up| > |Subrange repair|trunk|Sequential + single token range|"No anticompaction > (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range"| > |Subrange repair|trunk|Parallel + 10 token ranges which have the same > replicas|"No anticompaction (repairedAt == 0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > A single repair session will handle all subranges at once"| > |Subrange repair|trunk|Parallel + 10 token ranges which have different > replicas|"No anticompaction (repairedAt==0) > Out of sync ranges > 0 > Subsequent run must show no out of sync range > More than one repair session is triggered to process all subranges"| > |Subrange repair|trunk|"Single token range. > Force terminate repair shortly after it was triggered."|Repair threads must > be cleaned up| > |Incremental repair|trunk|"Parallel (mandatory) > No compaction during repair"|"Anticompaction status (repairedAt != 0) on all > SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|"Parallel (mandatory) > Major compaction triggered during repair"|"Anticompaction status (repairedAt > != 0) on all SSTables > No pending repair on SSTables after completion (could require to wait a bit > as this will happen asynchronously) > Out of sync ranges > 0 + Subsequent run must show no out of sync range"| > |Incremental repair|trunk|Force terminate repair shortly after it was > triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16245) Implement repair quality test scenarios
Alexander Dejanovski created CASSANDRA-16245: Summary: Implement repair quality test scenarios Key: CASSANDRA-16245 URL: https://issues.apache.org/jira/browse/CASSANDRA-16245 Project: Cassandra Issue Type: Task Components: Test/dtest/java Reporter: Alexander Dejanovski Implement the following test scenarios in a new test suite for repair integration testing with significant load: Generate/restore a workload of ~100GB per node. Medusa should be considered to create the initial backup which could then be restored from an S3 bucket to speed up node population. Data should on purpose require repair and be generated accordingly. Perform repairs for a 3 nodes cluster with 4 cores each and 16GB-32GB RAM (m5d.xlarge instances would be the most cost efficient type). Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for subranges with different sets of replicas). ||Mode||Version||Settings||Checks|| |Full repair|trunk|Sequential + All token ranges|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Parallel + Primary range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Full repair|trunk|Force terminate repair shortly after it was triggered|Repair threads must be cleaned up| |Subrange repair|trunk|Sequential + single token range|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range"| |Subrange repair|trunk|Parallel + 10 token ranges which have the same replicas|"No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range A single repair session will handle all subranges at once"| |Subrange repair|trunk|Parallel + 10 token ranges which have different replicas|"No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range More than one repair session is triggered to process all subranges"| |Subrange repair|trunk|"Single token range. Force terminate repair shortly after it was triggered."|Repair threads must be cleaned up| |Incremental repair|trunk|"Parallel (mandatory) No compaction during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|"Parallel (mandatory) Major compaction triggered during repair"|"Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion (could require to wait a bit as this will happen asynchronously) Out of sync ranges > 0 + Subsequent run must show no out of sync range"| |Incremental repair|trunk|Force terminate repair shortly after it was triggered.|Repair threads must be cleaned up| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-16244: - Fix Version/s: 4.0-rc > Create a jvm upgrade dtest for mixed versions repairs > - > > Key: CASSANDRA-16244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16244 > Project: Cassandra > Issue Type: Task >Reporter: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Repair during upgrades should fail on mixed version clusters. > We'd need an in-jvm upgrade dtest to check that repair indeed fails as > expected with mixed current version+previous major version clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16244) Create a jvm upgrade dtest for mixed versions repairs
Alexander Dejanovski created CASSANDRA-16244: Summary: Create a jvm upgrade dtest for mixed versions repairs Key: CASSANDRA-16244 URL: https://issues.apache.org/jira/browse/CASSANDRA-16244 Project: Cassandra Issue Type: Task Reporter: Alexander Dejanovski Repair during upgrades should fail on mixed version clusters. We'd need an in-jvm upgrade dtest to check that repair indeed fails as expected with mixed current version+previous major version clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224834#comment-17224834 ] Alexander Dejanovski commented on CASSANDRA-15580: -- Thanks for the feedback [~marcuse] and [~jmckenzie]! Good point for using in jvm upgrade dtests for testing mixed version repairs (y) I'll create a subticket for this. I'll add the flag from CASSANDRA-3200 to the test plan for sure, but need to think a little bit about what to test precisely in this scenario. We'd need to figure out the following things: * How/where do we provision the nodes? AFAIK we don't have a test suite such as the one we're planning to build here which will require to spin up actual clusters on external VMs (no ccm). Spinning up AWS instances is a low friction path with a tool such as tlp-cluster (we'll need a sponsor for hosting the instances). k8s is probably down the path but it could be good to have the community operator before we use it. Are there any other obvious tools/ways to spin up multi instances clusters? * Which testing framework to use? I personally like using Gherkin syntax based frameworks such as Cucumber, but we'd need to get a feel for the community's appetite to introduce such a framework. Otherwise we'd probably fallback to JUnit but my take on this is that while it's really good for unit tests, it's no fit for integration tests. Any input/opinion on testing framework is appreciated. * Where is that test suite stored? It would be good to have it stored directly in the Cassandra repo but we could store it in a side project like it was done for dtests. > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16161) Validation Compactions causing Java GC pressure
[ https://issues.apache.org/jira/browse/CASSANDRA-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224671#comment-17224671 ] Alexander Dejanovski commented on CASSANDRA-16161: -- My initial preference while reading this ticket was to use different throttles as well. As mentioned by [~mck], validation compactions put a slightly different type of pressure on nodes and folks might want to unthrottle validation compactions (current behavior) and keep a throttle on compactions. But on the other hand, disks are now way faster than they used to be when compaction throttle was introduced and heap pressure is mostly what we're protecting the clusters from I guess. We used to consider at TLP that over 45MB/s Cassandra couldn't keep up anyway in compaction sustained throughput because of heap pressure (it would only allow bursts). This will indeed change in 4.0 with all the improvements made to lower compaction heap pressure. I also think that repair should be throttled in order to lighten its impact on clusters and force folks to investigate if it's not going fast enough, rather than harm clusters when using defaults to make it go fast. Also, adding a new hidden configuration setting just for 3.0/3.x this close to 4.0 going GA doesn't seem like the best thing to do. TL;DR: +1 on using {{compaction_throughput_mb_per_sec}} to throttle validation compactions as well as standard compactions. > Validation Compactions causing Java GC pressure > --- > > Key: CASSANDRA-16161 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16161 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Local/Config, Tool/nodetool >Reporter: Cameron Zemek >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 3.11.x, 3.11.8 > > Attachments: 16161.patch > > > Validation Compactions are not rate limited which can cause Java GC pressure > and result in spikes in latency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15902: - Status: Ready to Commit (was: Review In Progress) > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail:
[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15902: - Status: Patch Available (was: Review In Progress) LGTM > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail:
[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15902: - Reviewers: Alexander Dejanovski, Alexander Dejanovski (was: Alexander Dejanovski) Alexander Dejanovski, Alexander Dejanovski (was: Alexander Dejanovski) Status: Review In Progress (was: Patch Available) > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222163#comment-17222163 ] Alexander Dejanovski commented on CASSANDRA-15902: -- The code looks good to me. Patch works as expected and changes to the non-testing code are minimal. Unit tests for repairs are all passing . > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212949#comment-17212949 ] Alexander Dejanovski commented on CASSANDRA-15580: -- Here's a test plan proposal: Generate/restore a workload of ~100GB to 200GB per node. Some SSTables will have to be deleted (in a random fashion?) to make repair go through streaming sessions. Perform repairs for a 3 nodes cluster with 4 cores each and 16GB RAM. Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for subranges with different sets of replicas). || Mode|| Version || Settings|| Checks || | Full repair | trunk | Sequential + All token ranges | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Full repair | trunk | Parallel + Primary range | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Full repair | trunk | Force terminate repair shortly after it was triggered | Repair threads must be cleaned up | | Full repair | Mixed trunk + latest 3.11.x | Sequential + All token ranges | Repair should fail | | Subrange repair | trunk | Sequential + single token range | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Subrange repair | trunk | Parallel + 10 token ranges which have the same replicas | "No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range + Check that repair sessions are cleaned up after a force terminate" | | Subrange repair | trunk | Parallel + 10 token ranges which have different replicas| "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range + Check that repair sessions are cleaned up after a force terminate" | | Subrange repair | trunk | "Single token range. Force terminate repair shortly after it was triggered." | Repair threads must be cleaned up | | Subrange repair | Mixed trunk + latest 3.11.x | Sequential + single token range | Repair should fail | | Incremental repair | trunk | "Parallel (mandatory) No compaction during repair"| "Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion Out of sync ranges > 0 + Subsequent run must show no out of sync range" | | Incremental repair | trunk | "Parallel (mandatory) Major compaction triggered during repair" | "Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion Out of sync ranges > 0 + Subsequent run must show no out of sync range" | | Incremental repair | trunk | Force terminate repair shortly after it was triggered. | Repair threads must be cleaned up | | Incremental repair | Mixed trunk + latest 3.11.x | Parallel| Repair should fail | I'm not sure about fuzz testing repair though. It's not a resilient process and isn't designed as such. Resiliency is obtained through third party tools that will reschedule failed repairs. If a node is/goes down and should be part of a repair session, the repair session will simply fail AFAIK. The mixed version tests could be challenging to set up as we probably don't want to pin a specific version as being the "previous" one. Should this test be performed consistently between trunk and the previous major version? On a major version bump (when trunk moves to 5.0), I'd expect the test to pass as repair will probably work for a bit, unless there's a check on version numbers during repair/streaming? > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools
[jira] [Updated] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15580: - Description: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: Alexander Dejanovski* We aim for 4.0 to have the first fully functioning incremental repair solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of repair: (full range, sub range, incremental) function as expected as well as ensuring community tools such as Reaper work. CASSANDRA-3200 adds an experimental option to reduce the amount of data streamed during repair, we should write more tests and see how it works with big nodes. was: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: None* We aim for 4.0 to have the first fully functioning incremental repair solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of repair: (full range, sub range, incremental) function as expected as well as ensuring community tools such as Reaper work. CASSANDRA-3200 adds an experimental option to reduce the amount of data streamed during repair, we should write more tests and see how it works with big nodes. > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211599#comment-17211599 ] Alexander Dejanovski commented on CASSANDRA-15580: -- I'll shepherd this ticket and start designing the test scenarios. [~jmckenzie], regarding the use of Fallout, isn't that conversation supposed to take place in CASSANDRA-15585? If we go down that path (and Fallout looks like a great tool for the job), it means the reviewers and contributors need to get up to speed with it before anything can happen. Time wise it could play against fast completion. My understanding also is that the OSS version of Fallout works with k8s exclusively, which would require k8s clusters to be available from CI (just mentioning it as I'm not sure that's something we have yet). Also, where should these test live in the project? The natural fit for them would be dtests (which would mean using ccm), but running tests with big nodes could be challenging in this environment. Was there a plan to create a new repo or just a new set of dtests? I'd love for us to test repair on node with >= 100GB, but generating the data could take quite some time. Using backups could make that part faster if we can get an S3 bucket (or similar) to store the data on. [~mck], you've been spending quite some time on the project CI lately, so your input on what can/cannot be done there would be much appreciated. [~marcuse] [~vinaychella], are you still willing to review the deliverables here? What's your take on tooling and where the tests should live? > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: None* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15580: - Fix Version/s: (was: 4.0-beta) 4.0-rc > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: None* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-15580) 4.0 quality testing: Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski reassigned CASSANDRA-15580: Assignee: Alexander Dejanovski > 4.0 quality testing: Repair > --- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python >Reporter: Josh McKenzie >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: None* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210644#comment-17210644 ] Alexander Dejanovski commented on CASSANDRA-15902: -- 3.0 works as expected as well once patched. I'll proceed with code review now. > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem
[ https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15584: - Description: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: Benjamin Lerer* Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools function properly. Current list of tools: || Name || Status || Contact || | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT STARTED* | [~stefan.miklosovic]| | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT STARTED* | [~stefan.miklosovic]| | [Instaclustr Cassandra operator|https://github.com/instaclustr/cassandra-operator]| {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Cassandra Backup Restore | https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Cassandra Sidecar | https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra SSTable generator | https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| [~stefan.miklosovic]| | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra Everywhere Strategy | https://github.com/instaclustr/cassandra-everywhere-strategy] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | [~adejanovski]| | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *IN PROGRESS*| [~adejanovski]| | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck Dehay| | [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]| {color:#00875A}*DONE*{color}| [~jtgrabowski]| | [cass operator|https://github.com/datastax/cass-operator]| {color:#00875A}*DONE*{color}| [~jimdickinson]| | [metric collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| {color:#00875A}*DONE*{color}| [~tjake]| | [managment API|https://github.com/datastax/management-api-for-apache-cassandra]| {color:#00875A}*DONE*{color}| [~tjake]| Columns descriptions: * *Name*: Name and link to the tool official page * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if testing 4.0 is part of your CI process. * *Contact*: The person acting as the contact point for that tool. was: Reference [doc from NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] for context. *Shepherd: Benjamin Lerer* Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools function properly. Current list of tools: || Name || Status || Contact || | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | *NOT STARTED* | [~stefan.miklosovic]| | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| *NOT STARTED* | [~stefan.miklosovic]| | [Instaclustr Cassandra operator|https://github.com/instaclustr/cassandra-operator]| {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Cassandra Backup Restore | https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Instaclustr Cassandra Sidecar | https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra SSTable generator | https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| [~stefan.miklosovic]| | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Cassandra Everywhere Strategy | https://github.com/instaclustr/cassandra-everywhere-strategy] | {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | [~adejanovski]| | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *NOT STARTED*| [~adejanovski]| | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| Franck Dehay| |
[jira] [Commented] (CASSANDRA-15584) 4.0 quality testing: Tooling - External Ecosystem
[ https://issues.apache.org/jira/browse/CASSANDRA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210106#comment-17210106 ] Alexander Dejanovski commented on CASSANDRA-15584: -- [~blerer], I've added integration tests for Medusa against trunk, which are breaking now when the sstableloader is being used on a cluster with client to server encryption : https://github.com/thelastpickle/cassandra-medusa/actions/runs/291725292 I need to investigate this issue more closely and maybe open a JIRA if there's indeed a problem in the sstableloader. > 4.0 quality testing: Tooling - External Ecosystem > - > > Key: CASSANDRA-15584 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15584 > Project: Cassandra > Issue Type: Task > Components: Tool/external >Reporter: Josh McKenzie >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Benjamin Lerer* > Many users of Apache Cassandra employ open source tooling to automate > Cassandra configuration, runtime management, and repair scheduling. Prior to > release, we need to confirm that popular third-party tools function properly. > Current list of tools: > || Name || Status || Contact || > | [Priam|http://netflix.github.io/Priam/] |{color:#00875A} *DONE WITH > ALPHA*{color} (need to be tested with beta) | [~sumanth.pasupuleti]| > | [sstabletools|https://github.com/instaclustr/cassandra-sstable-tools] | > *NOT STARTED* | [~stefan.miklosovic]| > | [cassandra-exporter|https://github.com/instaclustr/cassandra-exporter]| > *NOT STARTED* | [~stefan.miklosovic]| > | [Instaclustr Cassandra > operator|https://github.com/instaclustr/cassandra-operator]| > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Instaclustr Cassandra Backup Restore | > https://github.com/instaclustr/cassandra-backup]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Instaclustr Cassandra Sidecar | > https://github.com/instaclustr/cassandra-sidecar]|{color:#00875A}*DONE*{color} > | [~stefan.miklosovic]| > | [Cassandra SSTable generator | > https://github.com/instaclustr/cassandra-sstable-generator]|{color:#00875A}*DONE*{color}| > [~stefan.miklosovic]| > | [Cassandra TTL Remover | https://github.com/instaclustr/TTLRemover] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Cassandra Everywhere Strategy | > https://github.com/instaclustr/cassandra-everywhere-strategy] | > {color:#00875A}*DONE*{color} | [~stefan.miklosovic]| > | [Reaper|http://cassandra-reaper.io/]| {color:#00875A}*AUTOMATIC*{color} | > [~adejanovski]| > | [Medusa|https://github.com/thelastpickle/cassandra-medusa]| *NOT STARTED*| > [~adejanovski]| > | [Casskop|https://orange-opensource.github.io/casskop/]| *NOT STARTED*| > Franck Dehay| > | > [spark-cassandra-connector|https://github.com/datastax/spark-cassandra-connector]| > {color:#00875A}*DONE*{color}| [~jtgrabowski]| > | [cass operator|https://github.com/datastax/cass-operator]| > {color:#00875A}*DONE*{color}| [~jimdickinson]| > | [metric > collector|https://github.com/datastax/metric-collector-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > | [managment > API|https://github.com/datastax/management-api-for-apache-cassandra]| > {color:#00875A}*DONE*{color}| [~tjake]| > Columns descriptions: > * *Name*: Name and link to the tool official page > * *Status*: {{NOT STARTED}}, {{IN PROGRESS}}, {{BLOCKED}} if you hit any > issue and have to wait for it to be solved, {{DONE}}, {{AUTOMATIC}} if > testing 4.0 is part of your CI process. > * *Contact*: The person acting as the contact point for that tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206250#comment-17206250 ] Alexander Dejanovski commented on CASSANDRA-15902: -- So far, so good. I've reproduced the issue in 3.11 using a low timeout in Reaper and repair sessions started to pile up indefinitely: {code:java} % x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep -v grep| cut -d' ' -f3) |grep 'Repair#'\" cassandra" "Repair#11:1" #2193 daemon prio=5 os_prio=0 tid=0x7fe15b19f530 nid=0x74d8 waiting on condition [0x7fe145968000] "Repair#10:1" #2154 daemon prio=5 os_prio=0 tid=0x7fe16d7eceb0 nid=0x7471 waiting on condition [0x7fe12bf12000] "Repair#8:1" #2116 daemon prio=5 os_prio=0 tid=0x7fe150316b40 nid=0x73f1 waiting on condition [0x7fe12ce09000] "Repair#7:1" #2084 daemon prio=5 os_prio=0 tid=0x7fe150162f80 nid=0x73a9 waiting on condition [0x7fe137894000] "Repair#3:1" #1704 daemon prio=5 os_prio=0 tid=0x7fe10f1b98d0 nid=0x6b9a waiting on condition [0x7fe1428fc000] "Repair#14:1" #1778 daemon prio=5 os_prio=0 tid=0x565030775bb0 nid=0x6d58 waiting on condition [0x7f8d08659000] "Repair#9:1" #1573 daemon prio=5 os_prio=0 tid=0x7f8d28770af0 nid=0x6b88 waiting on condition [0x7f8d1ff39000] "Repair#2:1" #1397 daemon prio=5 os_prio=0 tid=0x7f8d2815eb70 nid=0x6851 waiting on condition [0x7f8d1f9a] "Repair#1:1" #1375 daemon prio=5 os_prio=0 tid=0x7f8c67dcee40 nid=0x66a8 waiting on condition [0x7f8d1cc6f000] "Repair#1:1" #2412 daemon prio=5 os_prio=0 tid=0x7fc61d2a38f0 nid=0x6ed9 waiting on condition [0x7fc60736d000] {code} Then I built the patched version and waited again for repairs to time out for a little while. I never got more than one repair thread: {code:java} % x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep -v grep| cut -d' ' -f2) |grep 'Repair#'\" cassandra" "Repair#21:1" #682 daemon prio=5 os_prio=0 tid=0x7f249854cc10 nid=0x7ced waiting on condition [0x7f246f779000] {code} I'm currently checking that repair still go through as expected with a regular timeout and that is still running. Once that's done, I'll check again against 3.0 and then perform a code review. > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at
[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15902: - Reviewers: Alexander Dejanovski, Alexander Dejanovski (was: Alexander Dejanovski) Alexander Dejanovski, Alexander Dejanovski Status: Review In Progress (was: Patch Available) Starting testing and review. > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203277#comment-17203277 ] Alexander Dejanovski commented on CASSANDRA-15902: -- Hi [~moczarski], I'm aware of similar reports regarding repair sessions not being cleaned up correctly. I'll happily test this patch and perform a review. > OOM because repair session thread not closed when terminating repair > > > Key: CASSANDRA-15902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15902 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Swen Fuhrmann >Assignee: Swen Fuhrmann >Priority: Normal > Fix For: 3.0.x, 3.11.x > > Attachments: heap-mem-histo.txt, repair-terminated.txt > > > In our cluster, after a while some nodes running slowly out of memory. On > that nodes we observed that Cassandra Reaper terminate repairs with a JMX > call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because > reaching timeout of 30 min. > In the memory heap dump we see lot of instances of > {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory: > {noformat} > 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by > "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 > %) bytes. {noformat} > In the thread dump we see lot of repair threads: > {noformat} > grep "Repair#" threaddump.txt | wc -l > 50 {noformat} > > The repair jobs are waiting for the validation to finish: > {noformat} > "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 > nid=0x542a waiting on condition [0x7f81ee414000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007939bcfc8> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) > at > com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509) > at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at > org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) {noformat} > > Thats the line where the threads stuck: > {noformat} > // Wait for validation to complete > Futures.getUnchecked(validations); {noformat} > > The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops > the thread pool executor. It looks like that futures which are in progress > will therefor never be completed and the repair thread waits forever and > won't be finished. > > Environment: > Cassandra version: 3.11.4 and 3.11.6 > Cassandra Reaper: 1.4.0 > JVM memory settings: > {noformat} > -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > on another cluster with same issue: > {noformat} > -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat} > Java Runtime: > {noformat} > openjdk version "1.8.0_212" > OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03) > OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) > {noformat} > > The same issue described in this comment: > https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973 > As suggested in the comments I created this new specific ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179700#comment-17179700 ] Alexander Dejanovski commented on CASSANDRA-13701: -- New CI run with some additional adjustements in timings [here|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/248/tests]. The last failing test is fixed in trunk by [this commit|https://github.com/apache/cassandra/commit/c94ececec0fcd87459858370396d6cd586853787]. It's unrelated to this ticket. I've squashed the commits in [my cassandra-dtests branch|https://github.com/adejanovski/cassandra-dtest/tree/CASSANDRA-13701], but I still need to drop the commit that points to [the patched version of ccm|https://github.com/adejanovski/ccm/tree/CASSANDRA-13701]. Let's wait for the conversation to settle in the ASF Slack before moving on here. Maybe we should re-run CI again to see if we have some flaky tests that would be related to this ticket? > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177965#comment-17177965 ] Alexander Dejanovski commented on CASSANDRA-13701: -- I've identified several issues today: * ccm uses a hardcoded 30s timeout when waiting for events (like nodes to start) which doesn't work with the additional wait times that come with the new token allocation algorithm. Fix is [here|https://github.com/riptano/ccm/commit/8a91a5aa49473211863a1fb7a980206e5222ce5d]. * ccm starts all nodes at the same time when cluster.start() is invoked, which creates clashes when the new token allocation algorithm is used and makes some tests flaky. Starting them sequentially using [this fix|https://github.com/riptano/ccm/commit/e6e4abcff375debde8195104c5cffd1cecb8d6cf], allowed all the bootstrap dtests to pass. * [~jeromatron]'s branch is missing some commits in the current trunk that fix other failing dtests. Rebasing it over trunk is necessary to get them all to pass * Adding a few seconds of sleep in [bootstrap_test.py::TestBootstrap::test_simultaneous_bootstrap|https://github.com/adejanovski/cassandra-dtest/blob/master/bootstrap_test.py#L769-L771] allows the test to pass. I'm currently rerunning all dtests with the various fixes to see if I still get failures. I'll follow up on monday and hopefully push PRs to ccm and cassandra-dtests that will allow the patch to be applied (there are conflicts though so a rebase will be necessary). A follow up discussion and ticket will probably be necessary because the new token allocation algorithm and concurrent bootstraps aren't working nicely together. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176806#comment-17176806 ] Alexander Dejanovski commented on CASSANDRA-13701: -- Thanks [~brandon.williams], that's valuable information and I can move on to fixing the other tests. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176483#comment-17176483 ] Alexander Dejanovski commented on CASSANDRA-13701: -- Quick update: I was able to make the *bootstrap_test.py::TestBootstrap::test_simultaneous_bootstrap* pass with this branch. The test assumes that both starting nodes will see each other when they check for endpoint collision. But if the nodes start at exactly the same time (or roughly), then they can both perform the check while none of them is gossiping yet, meaning only node1 is part of the ring, which allows them to get tokens and start bootstrapping. Since there's a 30s pause, waiting for gossip to settle, adding a 10s pause between node2 and node3 startup allows us to "luckily" avoid the race condition. The code is not bulletproof to this scenario though. I still wonder why this is only happening with the new token allocation algorithm. Furthermore, tests are executed with num_tokens = 1, which makes it fairly fast to pick a token. It seems like the orchestration is different between the random token allocation and the rf based allocation which makes the race condition more obvious. I'll check the other failing tests tomorrow to see if we're dealing with the same problems. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176340#comment-17176340 ] Alexander Dejanovski commented on CASSANDRA-13701: -- [~jeromatron], I'm picking this up. Initial observation is that on the test_simultaneous_bootstrap test, node2 manages to bootstrap before node3 gets a chance to get kicked off. I'll go through the Cassandra code paths of bootstrap in order to understand how the new token allocation algorithm impacts us here. Will send an update soon. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski reassigned CASSANDRA-13701: Assignee: Alexander Dejanovski > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Alexander Dejanovski >Priority: Low > Fix For: 4.0-alpha > > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode
[ https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-15878: - Test and Documentation Plan: unit tests added Status: Patch Available (was: In Progress) > Ec2Snitch fails on upgrade in legacy mode > - > > Key: CASSANDRA-15878 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15878 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Distributed Metadata >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the > Ec2Snitch to match AWS conventions. > The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and > keep the same naming as before (while the "standard" mode uses the new naming > convention). > When performing an upgrade in the us-west-2 region, the second node failed to > start with the following exception: > > {code:java} > ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled > snitch appears to be using the legacy naming scheme for regions, but existing > nodes in cluster are using the opposite: region(s) = [us-west-2], > availability zone(s) = [2a]. Please check the ec2_naming_scheme property in > the cassandra-rackdc.properties configuration file for more details. > ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception > encountered during startup > java.lang.IllegalStateException: null > at > org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573) > at > org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) > at > org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:659) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:610) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767) > {code} > > The exception leads back to [this piece of > code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185]. > After adding some logging, it turned out the DC name of the first upgraded > node was considered invalid as a legacy one: > {code:java} > INFO [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC > us-west-2 > INFO [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - > dcUsesLegacyFormat=false / usingLegacyNaming=true > ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name > us-west-2 > {code} > > The problem is that the regex that's used to identify legacy dc names will > match both old and new names : > {code:java} > boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*"); > {code} > Knowing that some dc names didn't change between the two modes (us-west-2 for > example), I don't see how we can use the dc names to detect if the legacy > mode is being used by other nodes in the cluster. > > The rack names on the other hand are totally different in the legacy and > standard modes and can be used to detect mismatching settings. > > My go to fix would be to drop the check on datacenters by removing the > following lines: > [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode
[ https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147566#comment-17147566 ] Alexander Dejanovski commented on CASSANDRA-15878: -- Thanks for the feedback [~jolynch]. I've added back the DC name check and adjusted it as suggested. I provided accurate informations on which case we're actually covering now with this check. [~mck], I've reintroduced the unit tests that I had deleted and changed the assertions where needed. You can check the changes here: [https://github.com/apache/cassandra/compare/trunk...thelastpickle:CASSANDRA-15878] Let me know what you think. > Ec2Snitch fails on upgrade in legacy mode > - > > Key: CASSANDRA-15878 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15878 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Distributed Metadata >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the > Ec2Snitch to match AWS conventions. > The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and > keep the same naming as before (while the "standard" mode uses the new naming > convention). > When performing an upgrade in the us-west-2 region, the second node failed to > start with the following exception: > > {code:java} > ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled > snitch appears to be using the legacy naming scheme for regions, but existing > nodes in cluster are using the opposite: region(s) = [us-west-2], > availability zone(s) = [2a]. Please check the ec2_naming_scheme property in > the cassandra-rackdc.properties configuration file for more details. > ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception > encountered during startup > java.lang.IllegalStateException: null > at > org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573) > at > org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) > at > org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:659) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:610) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767) > {code} > > The exception leads back to [this piece of > code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185]. > After adding some logging, it turned out the DC name of the first upgraded > node was considered invalid as a legacy one: > {code:java} > INFO [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC > us-west-2 > INFO [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - > dcUsesLegacyFormat=false / usingLegacyNaming=true > ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name > us-west-2 > {code} > > The problem is that the regex that's used to identify legacy dc names will > match both old and new names : > {code:java} > boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*"); > {code} > Knowing that some dc names didn't change between the two modes (us-west-2 for > example), I don't see how we can use the dc names to detect if the legacy > mode is being used by other nodes in the cluster. > > The rack names on the other hand are totally different in the legacy and > standard modes and can be used to detect mismatching settings. > > My go to fix would be to drop the check on datacenters by removing the > following lines: > [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode
[ https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146442#comment-17146442 ] Alexander Dejanovski commented on CASSANDRA-15878: -- I've pushed a commit with a potential fix and updated unit tests: [https://github.com/apache/cassandra/pull/653/commits/7a53846a217102143ae56416ebcf534c59de93e6] [~jolynch], I'd love to have your input on this since you reviewed the original ticket that brought this change. Are there cases I'm not seeing where the dc name would be useful to check? > Ec2Snitch fails on upgrade in legacy mode > - > > Key: CASSANDRA-15878 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15878 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Distributed Metadata >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the > Ec2Snitch to match AWS conventions. > The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and > keep the same naming as before (while the "standard" mode uses the new naming > convention). > When performing an upgrade in the us-west-2 region, the second node failed to > start with the following exception: > > {code:java} > ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled > snitch appears to be using the legacy naming scheme for regions, but existing > nodes in cluster are using the opposite: region(s) = [us-west-2], > availability zone(s) = [2a]. Please check the ec2_naming_scheme property in > the cassandra-rackdc.properties configuration file for more details. > ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception > encountered during startup > java.lang.IllegalStateException: null > at > org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573) > at > org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) > at > org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:659) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:610) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767) > {code} > > The exception leads back to [this piece of > code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185]. > After adding some logging, it turned out the DC name of the first upgraded > node was considered invalid as a legacy one: > {code:java} > INFO [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC > us-west-2 > INFO [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - > dcUsesLegacyFormat=false / usingLegacyNaming=true > ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name > us-west-2 > {code} > > The problem is that the regex that's used to identify legacy dc names will > match both old and new names : > {code:java} > boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*"); > {code} > Knowing that some dc names didn't change between the two modes (us-west-2 for > example), I don't see how we can use the dc names to detect if the legacy > mode is being used by other nodes in the cluster. > > The rack names on the other hand are totally different in the legacy and > standard modes and can be used to detect mismatching settings. > > My go to fix would be to drop the check on datacenters by removing the > following lines: > [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode
[ https://issues.apache.org/jira/browse/CASSANDRA-15878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski reassigned CASSANDRA-15878: Assignee: Alexander Dejanovski > Ec2Snitch fails on upgrade in legacy mode > - > > Key: CASSANDRA-15878 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15878 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Distributed Metadata >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the > Ec2Snitch to match AWS conventions. > The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and > keep the same naming as before (while the "standard" mode uses the new naming > convention). > When performing an upgrade in the us-west-2 region, the second node failed to > start with the following exception: > > {code:java} > ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled > snitch appears to be using the legacy naming scheme for regions, but existing > nodes in cluster are using the opposite: region(s) = [us-west-2], > availability zone(s) = [2a]. Please check the ec2_naming_scheme property in > the cassandra-rackdc.properties configuration file for more details. > ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception > encountered during startup > java.lang.IllegalStateException: null > at > org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573) > at > org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) > at > org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:659) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:610) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767) > {code} > > The exception leads back to [this piece of > code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185]. > After adding some logging, it turned out the DC name of the first upgraded > node was considered invalid as a legacy one: > {code:java} > INFO [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC > us-west-2 > INFO [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - > dcUsesLegacyFormat=false / usingLegacyNaming=true > ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name > us-west-2 > {code} > > The problem is that the regex that's used to identify legacy dc names will > match both old and new names : > {code:java} > boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*"); > {code} > Knowing that some dc names didn't change between the two modes (us-west-2 for > example), I don't see how we can use the dc names to detect if the legacy > mode is being used by other nodes in the cluster. > > The rack names on the other hand are totally different in the legacy and > standard modes and can be used to detect mismatching settings. > > My go to fix would be to drop the check on datacenters by removing the > following lines: > [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15878) Ec2Snitch fails on upgrade in legacy mode
Alexander Dejanovski created CASSANDRA-15878: Summary: Ec2Snitch fails on upgrade in legacy mode Key: CASSANDRA-15878 URL: https://issues.apache.org/jira/browse/CASSANDRA-15878 Project: Cassandra Issue Type: Bug Reporter: Alexander Dejanovski CASSANDRA-7839 changed the way the EC2 DC/Rack naming was handled in the Ec2Snitch to match AWS conventions. The "legacy" mode was introduced to allow upgrades from Cassandra 3.0/3.x and keep the same naming as before (while the "standard" mode uses the new naming convention). When performing an upgrade in the us-west-2 region, the second node failed to start with the following exception: {code:java} ERROR [main] 2020-06-16 09:14:42,218 Ec2Snitch.java:210 - This ec2-enabled snitch appears to be using the legacy naming scheme for regions, but existing nodes in cluster are using the opposite: region(s) = [us-west-2], availability zone(s) = [2a]. Please check the ec2_naming_scheme property in the cassandra-rackdc.properties configuration file for more details. ERROR [main] 2020-06-16 09:14:42,219 CassandraDaemon.java:789 - Exception encountered during startup java.lang.IllegalStateException: null at org.apache.cassandra.service.StorageService.validateEndpointSnitch(StorageService.java:573) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:800) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:659) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:610) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:650) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:767) {code} The exception leads back to [this piece of code|https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L183-L185]. After adding some logging, it turned out the DC name of the first upgraded node was considered invalid as a legacy one: {code:java} INFO [main] 2020-06-16 09:14:42,216 Ec2Snitch.java:183 - Detected DC us-west-2 INFO [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:185 - dcUsesLegacyFormat=false / usingLegacyNaming=true ERROR [main] 2020-06-16 09:14:42,217 Ec2Snitch.java:188 - Invalid DC name us-west-2 {code} The problem is that the regex that's used to identify legacy dc names will match both old and new names : {code:java} boolean dcUsesLegacyFormat = !dc.matches("[a-z]+-[a-z].+-[\\d].*"); {code} Knowing that some dc names didn't change between the two modes (us-west-2 for example), I don't see how we can use the dc names to detect if the legacy mode is being used by other nodes in the cluster. The rack names on the other hand are totally different in the legacy and standard modes and can be used to detect mismatching settings. My go to fix would be to drop the check on datacenters by removing the following lines: [https://github.com/apache/cassandra/blob/cassandra-4.0-alpha4/src/java/org/apache/cassandra/locator/Ec2Snitch.java#L172-L186] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels
[ https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072443#comment-17072443 ] Alexander Dejanovski commented on CASSANDRA-15661: -- Thanks for adding more info on the native connection limit logging. I checked the test results and indeed I don't see how changing the logging levels could be responsible for the DTests failure here. The patch looks good to me now (y) > Improve logging by using more appropriate levels > - > > Key: CASSANDRA-15661 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15661 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Normal > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > There are a number of log statements using logging levels that are a bit too > conservative. For example: > * Flushing memtables is currently at debug. This is a relatively rare event > that is important enough to be INFO > * When compaction finishes we log the progress at debug > * Different steps in incremental repair are logged as debug, should be INFO > * when reaching connection limits in ConnectionLimitHandler.java we log at > warn rather than error. Since this is a client disconnect it’s more than a > warning, we’re taking action and disconnecting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels
[ https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068580#comment-17068580 ] Alexander Dejanovski commented on CASSANDRA-15661: -- [~rustyrazorblade], I'm overall super happy to get more loggings back in system.log. Having all of what's happening in flushes and compactions at debug level made my life as an ops much harder over the past years. I added a few comments on the PR regarding some that may not be fit for INFO level. They look more like ways to actually debug some implementation details around repair. Let me know what you think. > Improve logging by using more appropriate levels > - > > Key: CASSANDRA-15661 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15661 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Normal > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > There are a number of log statements using logging levels that are a bit too > conservative. For example: > * Flushing memtables is currently at debug. This is a relatively rare event > that is important enough to be INFO > * When compaction finishes we log the progress at debug > * Different steps in incremental repair are logged as debug, should be INFO > * when reaching connection limits in ConnectionLimitHandler.java we log at > warn rather than error. Since this is a client disconnect it’s more than a > warning, we’re taking action and disconnecting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15661) Improve logging by using more appropriate levels
[ https://issues.apache.org/jira/browse/CASSANDRA-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068385#comment-17068385 ] Alexander Dejanovski commented on CASSANDRA-15661: -- Starting the review on this ticket. > Improve logging by using more appropriate levels > - > > Key: CASSANDRA-15661 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15661 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Normal > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are a number of log statements using logging levels that are a bit too > conservative. For example: > * Flushing memtables is currently at debug. This is a relatively rare event > that is important enough to be INFO > * When compaction finishes we log the progress at debug > * Different steps in incremental repair are logged as debug, should be INFO > * when reaching connection limits in ConnectionLimitHandler.java we log at > warn rather than error. Since this is a client disconnect it’s more than a > warning, we’re taking action and disconnecting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-11105) cassandra-stress tool - InvalidQueryException: Batch too large
[ https://issues.apache.org/jira/browse/CASSANDRA-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026822#comment-17026822 ] Alexander Dejanovski commented on CASSANDRA-11105: -- I agree with [~mck]. The code has evolved too much anyway since my patch was written, and internally we've moved our efforts on a cassandra-stress replacement tool. Happy to have the ticket closed as "won't do". > cassandra-stress tool - InvalidQueryException: Batch too large > -- > > Key: CASSANDRA-11105 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11105 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Tools > Environment: Cassandra 2.2.4, Java 8, CentOS 6.5 >Reporter: Ralf Steppacher >Priority: Normal > Fix For: 4.0 > > Attachments: 11105-trunk.txt, batch_too_large.yaml > > > I am using Cassandra 2.2.4 and I am struggling to get the cassandra-stress > tool to work for my test scenario. I have followed the example on > http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema > to create a yaml file describing my test (attached). > I am collecting events per user id (text, partition key). Events have a > session type (text), event type (text), and creation time (timestamp) > (clustering keys, in that order). Plus some more attributes required for > rendering the events in a UI. For testing purposes I ended up with the > following column spec and insert distribution: > {noformat} > columnspec: > - name: created_at > cluster: uniform(10..1) > - name: event_type > size: uniform(5..10) > population: uniform(1..30) > cluster: uniform(1..30) > - name: session_type > size: fixed(5) > population: uniform(1..4) > cluster: uniform(1..4) > - name: user_id > size: fixed(15) > population: uniform(1..100) > - name: message > size: uniform(10..100) > population: uniform(1..100B) > insert: > partitions: fixed(1) > batchtype: UNLOGGED > select: fixed(1)/120 > {noformat} > Running stress tool for just the insert prints > {noformat} > Generating batches with [1..1] partitions and [0..1] rows (of [10..120] > total rows in the partitions) > {noformat} > and then immediately starts flooding me with > {{com.datastax.driver.core.exceptions.InvalidQueryException: Batch too > large}}. > Why I should be exceeding the {{batch_size_fail_threshold_in_kb: 50}} in the > {{cassandra.yaml}} I do not understand. My understanding is that the stress > tool should generate one row per batch. The size of a single row should not > exceed {{8+10*3+5*3+15*3+100*3 = 398 bytes}}. Assuming a worst case of all > text characters being 3 byte unicode characters. > This is how I start the attached user scenario: > {noformat} > [rsteppac@centos bin]$ ./cassandra-stress user > profile=../batch_too_large.yaml ops\(insert=1\) -log level=verbose > file=~/centos_event_by_patient_session_event_timestamp_insert_only.log -node > 10.211.55.8 > INFO 08:00:07 Did not find Netty's native epoll transport in the classpath, > defaulting to NIO. > INFO 08:00:08 Using data-center name 'datacenter1' for > DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct > datacenter name with DCAwareRoundRobinPolicy constructor) > INFO 08:00:08 New Cassandra host /10.211.55.8:9042 added > Connected to cluster: Titan_DEV > Datatacenter: datacenter1; Host: /10.211.55.8; Rack: rack1 > Created schema. Sleeping 1s for propagation. > Generating batches with [1..1] partitions and [0..1] rows (of [10..120] > total rows in the partitions) > com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large > at > com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35) > at > com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:271) > at > com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:185) > at > com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:55) > at > org.apache.cassandra.stress.operations.userdefined.SchemaInsert$JavaDriverRun.run(SchemaInsert.java:87) > at > org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:159) > at > org.apache.cassandra.stress.operations.userdefined.SchemaInsert.run(SchemaInsert.java:119) > at > org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:309) > Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch > too large > at > com.datastax.driver.core.Responses$Error.asException(Responses.java:125) > at >
[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619273#comment-16619273 ] Alexander Dejanovski commented on CASSANDRA-14685: -- Hi [~bdeggleston], Fair enough, I reckon I didn't wait that long for the SSTables to be released. if the SSTables get released eventually and you can't detect all types of failures to release them, I guess it would be worth failing a repair if some SSTables with overlapping token ranges are still part of another repair session. Otherwise, your left with the impression that running a repair would work correctly although some SSTables were skipped (and will be rolled back later). Wdyt? Advising to use "nodetool repair_admin" in the error message would help discover this new command. Stopping the session using it did the trick and the SSTables were released as expected. One weird behavior of streaming is that when the coordinator goes down, "nodetool netstats" still shows progress on the replicas until it reaches 100% and it stays like this. It even starts streaming new files although the target node is still down. > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Assignee: Jason Brown >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > ccm node1 stop > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > ccm node1 start{noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully > [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds > $ ccm node1 nodetool status > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID >Rack > UN 127.0.0.1 228,64 KiB 256 ? > 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 > UN 127.0.0.2 60,09 MiB 256 ? > fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 > UN 127.0.0.3 57,59 MiB 256
[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610574#comment-16610574 ] Alexander Dejanovski commented on CASSANDRA-14685: -- [~jasobrown], sure thing, no worries. For the "pending repair" part, I must add that it doesn't happen when a replica node goes down during repair, even if it comes back way after repair is over on the coordinator. Shortly after restart, SSTables are correctly released from the pending repair. It's only when the coordinator goes down that replicas remain in pending repair state, even after a restart of the Cassandra process on these nodes. > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Assignee: Jason Brown >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > ccm node1 stop > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > ccm node1 start{noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully > [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds > $ ccm node1 nodetool status > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID >Rack > UN 127.0.0.1 228,64 KiB 256 ? > 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 > UN 127.0.0.2 60,09 MiB 256 ? > fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 > UN 127.0.0.3 57,59 MiB 256 ? > a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 > {noformat} > sstablemetadata will then show that nodes 2 and 3 have SSTables still in > "pending repair" state : > {noformat} > ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | > grep repair > SSTable: > /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big > Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 > {noformat} >
[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599128#comment-16599128 ] Alexander Dejanovski commented on CASSANDRA-14685: -- [~jasobrown], indeed, nodes 2 and 3 are still showing ongoing streams although node1 is down : {noformat} $ ccm node2 nodetool netstats Mode: NORMAL Repair e28883b0-ad4b-11e8-82ca-5fbf27df5fb6 /127.0.0.1 Sending 2 files, 49304220 bytes total. Already sent 0 files, 5373952 bytes total /Users/adejanovski/.ccm/inc-repair-issue/node2/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db 5373952/34243878 bytes(15%) sent to idx:0/127.0.0.1 Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 0 2 0 Small messages n/a 0 244612 0 Gossip messages n/a 23 531 0 $ ccm node3 nodetool netstats Mode: NORMAL Repair e269d820-ad4b-11e8-82ca-5fbf27df5fb6 /127.0.0.1 Sending 2 files, 49166315 bytes total. Already sent 1 files, 11748602 bytes total /Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-11-big-Data.db 8865018/8865018 bytes(100%) sent to idx:0/127.0.0.1 /Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db 2883584/34198115 bytes(8%) sent to idx:0/127.0.0.1 Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 0 2 0 Small messages n/a 0 244611 0 Gossip messages n/a 0 820 0 {noformat} > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Assignee: Jason Brown >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > ccm node1 stop > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > ccm node1 start{noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully >
[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-14685: - Description: The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' $f done for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; do grep -v "rpc_" $f > ${f}.tmp cat ${f}.tmp > $f done ccm start {noformat} I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a few 10s of MBs of data (killed it after some time). Obviously cassandra-stress works as well : {noformat} bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 --replication "{'class':'SimpleStrategy', 'replication_factor':2}" --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1 {noformat} Flush and delete all SSTables in node1 : {noformat} ccm node1 nodetool flush ccm node1 stop rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* ccm node1 start{noformat} Then throttle streaming throughput to 1MB/s so we have time to take node1 down during the streaming phase and run repair: {noformat} ccm node1 nodetool setstreamthroughput 1 ccm node2 nodetool setstreamthroughput 1 ccm node3 nodetool setstreamthroughput 1 ccm node1 nodetool repair tlp_stress {noformat} Once streaming starts, shut down node1 and start it again : {noformat} ccm node1 stop ccm node1 start {noformat} Run repair again : {noformat} ccm node1 nodetool repair tlp_stress {noformat} The command will return very quickly, showing that it skipped all sstables : {noformat} [2018-08-31 19:05:16,292] Repair completed successfully [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds $ ccm node1 nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens OwnsHost ID Rack UN 127.0.0.1 228,64 KiB 256 ? 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 UN 127.0.0.2 60,09 MiB 256 ? fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 UN 127.0.0.3 57,59 MiB 256 ? a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 {noformat} sstablemetadata will then show that nodes 2 and 3 have SSTables still in "pending repair" state : {noformat} ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | grep repair SSTable: /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 {noformat} Restarting these nodes wouldn't help either. was: The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e
[jira] [Created] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
Alexander Dejanovski created CASSANDRA-14685: Summary: Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming Key: CASSANDRA-14685 URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 Project: Cassandra Issue Type: Bug Components: Repair Reporter: Alexander Dejanovski The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' $f done for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; do grep -v "rpc_" $f > ${f}.tmp cat ${f}.tmp > $f done ccm start {noformat} I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a few 10s of MBs of data (killed it after some time). Obviously cassandra-stress works as well : {noformat} bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 --replication "{'class':'SimpleStrategy', 'replication_factor':2}" --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1 {noformat} Flush and delete all SSTables in node1 : {noformat} ccm node1 nodetool flush rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* {noformat} Then throttle streaming throughput to 1MB/s so we have time to take node1 down during the streaming phase and run repair: {noformat} ccm node1 nodetool setstreamthroughput 1 ccm node2 nodetool setstreamthroughput 1 ccm node3 nodetool setstreamthroughput 1 ccm node1 nodetool repair tlp_stress {noformat} Once streaming starts, shut down node1 and start it again : {noformat} ccm node1 stop ccm node1 start {noformat} Run repair again : {noformat} ccm node1 nodetool repair tlp_stress {noformat} The command will return very quickly, showing that it skipped all sstables : {noformat} [2018-08-31 19:05:16,292] Repair completed successfully [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds $ ccm node1 nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens OwnsHost ID Rack UN 127.0.0.1 228,64 KiB 256 ? 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 UN 127.0.0.2 60,09 MiB 256 ? fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 UN 127.0.0.3 57,59 MiB 256 ? a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 {noformat} sstablemetadata will then show that nodes 2 and 3 have SSTables still in "pending repair" state : {noformat} ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | grep repair SSTable: /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 {noformat} Restarting these nodes wouldn't help either. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10399) Create default Stress tables without compact storage
[ https://issues.apache.org/jira/browse/CASSANDRA-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423969#comment-16423969 ] Alexander Dejanovski commented on CASSANDRA-10399: -- This ticket should be closed as CASSANDRA-10857 already removed the use of COMPACT STORAGE throughout the whole codebase for 4.0. > Create default Stress tables without compact storage > - > > Key: CASSANDRA-10399 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10399 > Project: Cassandra > Issue Type: Improvement >Reporter: Sebastian Estevez >Assignee: mck >Priority: Minor > Labels: stress > Fix For: 4.x > > > ~$ cassandra-stress write > {code} > cqlsh> desc TABLE keyspace1.standard1 > CREATE TABLE keyspace1.standard1 ( > key blob PRIMARY KEY, > "C0" blob, > "C1" blob, > "C2" blob, > "C3" blob, > "C4" blob > ) WITH COMPACT STORAGE > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'} > AND compression = {} > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = 'NONE'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14318) Fix query pager DEBUG log leak causing hit in paged reads throughput
[ https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423964#comment-16423964 ] Alexander Dejanovski commented on CASSANDRA-14318: -- Thanks for reviewing and merging [~pauloricardomg] ! CASSANDRA-10857 removed compact storage options in trunk and the standard1 tables are no longer using it : [https://github.com/apache/cassandra/commit/07fbd8ee6042797aaade90357d625ba9d79c31e0#diff-e5d5cb263c5c84c322cd09391af46d7dL141] > Fix query pager DEBUG log leak causing hit in paged reads throughput > > > Key: CASSANDRA-14318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14318 > Project: Cassandra > Issue Type: Bug >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Major > Labels: lhf, performance > Fix For: 2.2.13 > > Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 > nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, > flame_graph_snapshot.png > > > Debug logging can involve in many cases (especially very low latency ones) a > very important overhead on the read path in 2.2 as we've seen when upgrading > clusters from 2.0 to 2.2. > The performance impact was especially noticeable on the client side metrics, > where p99 could go up to 10 times higher, while ClientRequest metrics > recorded by Cassandra didn't show any overhead. > Below shows latencies recorded on the client side with debug logging on > first, and then without it : > !debuglogging.png! > We generated a flame graph before turning off debug logging that shows the > read call stack is dominated by debug logging : > !flame_graph_snapshot.png! > I've attached the original flame graph for exploration. > Once disabled, the new flame graph shows that the read call stack gets > extremely thin, which is further confirmed by client recorded metrics : > !flame22 nodebug sjk svg.png! > The query pager code has been reworked since 3.0 and it looks like > log.debug() calls are gone there, but for 2.2 users and to prevent such > issues to appear with default settings, I really think debug logging should > be disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420677#comment-16420677 ] Alexander Dejanovski commented on CASSANDRA-14346: -- I was told that my comments sounded like I'm strongly opposed to this ticket, which is absolutely not the case so I'll sum up my thoughts here : * Coordinated repair is a must have and should be the first thing that's implemented * Scheduling and (especially) auto scheduling will require more thoughts and discussion IMHO, at least as long as incremental repair has not proved to be bulletproof in 4.0 (we still have to see it running in production for a while). Once we can repair any table/keyspace in just a few minutes things will be very different. * Based on what the Apache Cassandra project went through with new features lately, I wouldn't rush into implementing all of this by default and take a more cautious approach for 4.0. On a side note, because one might think I'm biased in that conversation (hum monologue so far), removing boilerplate from Reaper to have some features like computing the splits or coordinating the repair jobs handled by Cassandra internally would actually make me VERY happy. > Scheduled Repair in Cassandra > - > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair >Reporter: Joseph Lynch >Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual consistency. Most > recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked > for ways to solve this problem. > At Netflix we've built a scheduled repair service within Priam (our sidecar), > which we spoke about last year at NGCC. Given the positive feedback at NGCC > we focussed on getting it production ready and have now been using it in > production to repair hundreds of clusters, tens of thousands of nodes, and > petabytes of data for the past six months. Also based on feedback at NGCC we > have invested effort in figuring out how to integrate this natively into > Cassandra rather than open sourcing it as an external service (e.g. in Priam). > As such, [~vinaykumarcse] and I would like to re-work and merge our > implementation into Cassandra, and have created a [design > document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing] > showing how we plan to make it happen, including the the user interface. > As we work on the code migration from Priam to Cassandra, any feedback would > be greatly appreciated about the interface or v1 implementation features. I > have tried to call out in the document features which we explicitly consider > future work (as well as a path forward to implement them in the future) > because I would very much like to get this done before the 4.0 merge window > closes, and to do that I think aggressively pruning scope is going to be a > necessity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420577#comment-16420577 ] Alexander Dejanovski commented on CASSANDRA-14346: -- Two other issues with automated scheduling of repairs would be : * Rolling upgrades : All repairs would have to be terminated and schedules stopped as soon as the cluster is running mixed versions * Expansion to new DCs : if repair triggers during the expansion to a new DC before rebuild has fully ended on all nodes, the cluster will be crushed by the entropy repair will find. Since many users will not be aware that the cluster is constantly repairing itself, this is likely to happen a lot. The latter could be mitigated if a rebuild is detected and appropriate measures were taken. I'm not sure how we can detect this flawlessly though and there would still be many cases where the cluster has been expanded but rebuild isn't started right after. It could be argued that any scheduled repair system is subject to the same caveats, but the difference is that those systems are setup by a user, not by the database itself, which should then be responsible for protecting itself against such scenarios. > Scheduled Repair in Cassandra > - > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair >Reporter: Joseph Lynch >Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual consistency. Most > recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked > for ways to solve this problem. > At Netflix we've built a scheduled repair service within Priam (our sidecar), > which we spoke about last year at NGCC. Given the positive feedback at NGCC > we focussed on getting it production ready and have now been using it in > production to repair hundreds of clusters, tens of thousands of nodes, and > petabytes of data for the past six months. Also based on feedback at NGCC we > have invested effort in figuring out how to integrate this natively into > Cassandra rather than open sourcing it as an external service (e.g. in Priam). > As such, [~vinaykumarcse] and I would like to re-work and merge our > implementation into Cassandra, and have created a [design > document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing] > showing how we plan to make it happen, including the the user interface. > As we work on the code migration from Priam to Cassandra, any feedback would > be greatly appreciated about the interface or v1 implementation features. I > have tried to call out in the document features which we explicitly consider > future work (as well as a path forward to implement them in the future) > because I would very much like to get this done before the 4.0 merge window > closes, and to do that I think aggressively pruning scope is going to be a > necessity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420504#comment-16420504 ] Alexander Dejanovski commented on CASSANDRA-14346: -- I really like the idea of making repair something that is coordinated by the cluster instead of being node centric like currently. This is how it should be implemented, and external tools should only add features over this. nodetool really should be doing this by default. I globally agree with the state machine that is detailed (haven't spent that much time on it though...) I disagree with the doc Resiliency's point 6 that adding nodes won't impact the repair : it will change the token ranges and some of the splits will now spread across different replicas which will make them unsuitable for repair (think of clusters with 256 vnodes per node). You either have to cancel the repair or recompute the remaining splits to move on with the job. I would add a feature to your nodetool repairstatus command that allows to list only the currently running repairs. Then I think the approach of implementing a fully automated, seamless, continuous repair "that just works" without user intervention is unsafe in the wild, there are too many caveats. There are many different types of cluster out there and some of them just cannot run repair without careful tuning or monitoring (if at all). The current design shows no backpressure mechanism to ensure that further running sequences won't harm the cluster because it's already running late on compactions (may it be due to overstreaming or entropy, or just the activity of the cluster). Repairing by table will add a lot of overhead over repairing a list of tables (or all) in a single session, unless multiple repairs at once on a node are allowed, which won't permit to safely terminate a single repair. It is also unclear in the current design if repair can be disabled for select tables for example (like "type: none"). The proposal doesn't seem to involve any change into how "nodetool repair" behaves. Will it be changed to use the state machine and coordinate throughout the cluster ? Trying to replace external tools with built in features has its limits I think, and currently the design gives only limited control by such external tools (may it be Reaper or Datastax repair service or Priam or ...). To make an analogy that was seen recently on the ML, it's as if you implemented automatic spreading of configuration changes from within Cassandra instead of relying on tools like Chef or Puppet. You'll still need global tools to manage repairs over several clusters anyway, which a Cassandra built-in feature cannot (and should not) provide. My point is that making repair smarter and coordinated within Cassandra is a great idea and I support it 100%, but the current design makes it too automated and the defaults could easily lead to severe performance problems without the user triggering anything. I don't know either how it could be made to work along user defined repairs as you'll need to force terminate some sessions. To summarize, I would put aside the scheduling features and implement the coordinated repairs by splits within Cassandra. The StorageServiceMBean should evolve to allow manually setting the number of splits by node, or rely on a number of split generated by Cassandra itself. Then it should also be possible to track progress externally by listing splits (sequences) through JMX, and pause/resume select repair runs. Also, the current design should evolve to allow a single sequence to include multiple token ranges. We have that feature waiting to be merged in Reaper to group token ranges that have the same replicas, in order to reduce the overhead of vnodes. Starting with 3.0, repair jobs can be triggered with multiple token ranges that will be executed as a single session if the replicas are the same for all. So, to prevent having to change the data model in the future, I'd suggest storing a list of token ranges instead of just one. Repair events should be tracked in a separate table also to avoid overwriting the last event each time (one thing Reaper currently sucks at as well). I'll go back to the document soon and add my comments there. Cheers > Scheduled Repair in Cassandra > - > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair >Reporter: Joseph Lynch >Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual
[jira] [Commented] (CASSANDRA-14318) Debug logging can create massive performance issues
[ https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415923#comment-16415923 ] Alexander Dejanovski commented on CASSANDRA-14318: -- For the record, the same tests on 3.11.2 didn't show any notable performance difference between debug on and off : Cassandra 3.11.2 debug on : {noformat} Results: Op rate : 18 777 op/s [read_event_1: 3 165 op/s, read_event_2: 3 109 op/s, read_event_3: 12 562 op/s] Partition rate : 6 215 pk/s [read_event_1: 3 165 pk/s, read_event_2: 3 109 pk/s, read_event_3: 0 pk/s] Row rate : 6 215 row/s [read_event_1: 3 165 row/s, read_event_2: 3 109 row/s, read_event_3: 0 row/s] Latency mean : 6,7 ms [read_event_1: 6,7 ms, read_event_2: 6,7 ms, read_event_3: 6,6 ms] Latency median : 5,0 ms [read_event_1: 5,0 ms, read_event_2: 5,0 ms, read_event_3: 4,9 ms] Latency 95th percentile : 15,6 ms [read_event_1: 15,5 ms, read_event_2: 15,9 ms, read_event_3: 15,5 ms] Latency 99th percentile : 43,3 ms [read_event_1: 42,7 ms, read_event_2: 44,2 ms, read_event_3: 43,2 ms] Latency 99.9th percentile : 82,0 ms [read_event_1: 80,3 ms, read_event_2: 82,4 ms, read_event_3: 82,1 ms] Latency max : 272,4 ms [read_event_1: 272,4 ms, read_event_2: 268,7 ms, read_event_3: 245,1 ms] Total partitions : 330 970 [read_event_1: 165 386, read_event_2: 165 584, read_event_3: 0] Total errors : 0 [read_event_1: 0, read_event_2: 0, read_event_3: 0] Total GC count : 42 Total GC memory : 13,102 GiB Total GC time : 1,8 seconds Avg GC time : 42,4 ms StdDev GC time : 1,3 ms Total operation time : 00:00:53{noformat} Cassandra 3.11.2 debug off : {noformat} Results: Op rate : 18 853 op/s [read_event_1: 3 138 op/s, read_event_2: 3 137 op/s, read_event_3: 12 578 op/s] Partition rate : 6 275 pk/s [read_event_1: 3 138 pk/s, read_event_2: 3 137 pk/s, read_event_3: 0 pk/s] Row rate : 6 275 row/s [read_event_1: 3 138 row/s, read_event_2: 3 137 row/s, read_event_3: 0 row/s] Latency mean : 6,7 ms [read_event_1: 6,7 ms, read_event_2: 6,7 ms, read_event_3: 6,7 ms] Latency median : 5,0 ms [read_event_1: 5,1 ms, read_event_2: 5,1 ms, read_event_3: 5,0 ms] Latency 95th percentile : 15,5 ms [read_event_1: 15,5 ms, read_event_2: 15,6 ms, read_event_3: 15,4 ms] Latency 99th percentile : 39,9 ms [read_event_1: 41,0 ms, read_event_2: 39,6 ms, read_event_3: 39,6 ms] Latency 99.9th percentile : 73,3 ms [read_event_1: 73,4 ms, read_event_2: 71,6 ms, read_event_3: 73,6 ms] Latency max : 367,0 ms [read_event_1: 240,5 ms, read_event_2: 250,3 ms, read_event_3: 367,0 ms] Total partitions : 332 852 [read_event_1: 166 447, read_event_2: 166 405, read_event_3: 0] Total errors : 0 [read_event_1: 0, read_event_2: 0, read_event_3: 0] Total GC count : 46 Total GC memory : 14,024 GiB Total GC time : 2,0 seconds Avg GC time : 42,7 ms StdDev GC time : 3,9 ms Total operation time : 00:00:53{noformat} The improvement over 2.2 is nice though :) > Debug logging can create massive performance issues > --- > > Key: CASSANDRA-14318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14318 > Project: Cassandra > Issue Type: Bug >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Major > Labels: lhf, performance > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 > nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, > flame_graph_snapshot.png > > > Debug logging can involve in many cases (especially very low latency ones) a > very important overhead on the read path in 2.2 as we've seen when upgrading > clusters from 2.0 to 2.2. > The performance impact was especially noticeable on the client side metrics, > where p99 could go up to 10 times higher, while ClientRequest metrics > recorded by Cassandra didn't show any overhead. > Below shows latencies recorded on the client side with debug logging on > first, and then without it : > !debuglogging.png! > We generated a flame graph before turning off debug logging that shows the > read call stack is dominated by debug logging : > !flame_graph_snapshot.png! > I've attached the original flame graph for exploration. > Once disabled, the new flame graph shows that the read call stack gets > extremely thin, which is further confirmed by client recorded metrics : > !flame22 nodebug sjk svg.png! > The query pager code has been reworked since 3.0 and it looks like > log.debug() calls are gone there, but for 2.2 users and to prevent such > issues to appear with default settings, I really think debug logging should > be disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Comment Edited] (CASSANDRA-14318) Debug logging can create massive performance issues
[ https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415846#comment-16415846 ] Alexander Dejanovski edited comment on CASSANDRA-14318 at 3/27/18 4:16 PM: --- [~jjirsa]: apparently the ReadCallback class already logs at TRACE and not DEBUG on the latest 2.2. I've created the fix that downgrades debug logging to trace logging in the query pager classes, and here are the results : debug on - no fix : {noformat} Results: op rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] partition rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] row rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] latency mean : 19,1 [read_event_1:15,4, read_event_2:15,4, read_event_3:21,0] latency median : 15,6 [read_event_1:14,2, read_event_2:14,0, read_event_3:16,3] latency 95th percentile : 39,1 [read_event_1:28,4, read_event_2:28,6, read_event_3:44,2] latency 99th percentile : 75,6 [read_event_1:52,9, read_event_2:53,6, read_event_3:87,7] latency 99.9th percentile : 315,7 [read_event_1:101,0, read_event_2:110,1, read_event_3:361,1] latency max : 609,1 [read_event_1:319,6, read_event_2:315,9, read_event_3:609,1] Total partitions : 993050 [read_event_1:164882, read_event_2:166381, read_event_3:661787] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 189 total gc mb : 56464 total gc time (s) : 7 avg gc time(ms) : 37 stdev gc time(ms) : 8 Total operation time : 00:02:28{noformat} debug off - no fix : {noformat} Results: op rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] partition rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] row rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] latency mean : 10,1 [read_event_1:10,1, read_event_2:10,1, read_event_3:10,1] latency median : 9,2 [read_event_1:9,2, read_event_2:9,2, read_event_3:9,3] latency 95th percentile : 15,2 [read_event_1:15,8, read_event_2:15,9, read_event_3:15,7] latency 99th percentile : 29,3 [read_event_1:44,5, read_event_2:45,1, read_event_3:41,3] latency 99.9th percentile : 52,7 [read_event_1:67,9, read_event_2:66,9, read_event_3:67,1] latency max : 268,0 [read_event_1:257,1, read_event_2:263,3, read_event_3:268,0] Total partitions : 983056 [read_event_1:166311, read_event_2:162570, read_event_3:654175] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 100 total gc mb : 31529 total gc time (s) : 4 avg gc time(ms) : 37 stdev gc time(ms) : 5 Total operation time : 00:01:17{noformat} debug on - with fix : {noformat} Results: op rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] partition rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] row rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] latency mean : 10,4 [read_event_1:10,4, read_event_2:10,4, read_event_3:10,4] latency median : 9,4 [read_event_1:9,4, read_event_2:9,4, read_event_3:9,4] latency 95th percentile : 16,3 [read_event_1:16,8, read_event_2:17,3, read_event_3:16,2] latency 99th percentile : 36,6 [read_event_1:44,3, read_event_2:46,6, read_event_3:37,2] latency 99.9th percentile : 62,2 [read_event_1:78,0, read_event_2:77,1, read_event_3:80,8] latency max : 251,2 [read_event_1:246,9, read_event_2:249,9, read_event_3:251,2] Total partitions : 100 [read_event_1:167422, read_event_2:166861, read_event_3:665717] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 102 total gc mb : 31843 total gc time (s) : 4 avg gc time(ms) : 38 stdev gc time(ms) : 6 Total operation time : 00:01:21{noformat} So we have similar performance with debug logging off and with the fix and debug on. The difference in throughput is pretty massive as we roughly get *twice the read throughput* with the fix. Latencies without the fix and with the fix : p95 : 35ms -> 16ms p99 : 75ms -> 36ms I've ran all tests several times, alternating with and without the fix to make sure caches were not making a difference, and results were consistent with what's pasted above. It's been running on a single node using an i3.xlarge instance for Cassandra and another i3.large for running cassandra-stress. *One pretty interesting thing to note* is that when I tested with the predefined mode of cassandra-stress, no paging occurred and the performance difference was not noticeable. This is due to the fact that the predefined mode generates COMPACT STORAGE tables, which involve a different read path (apparently). I think anyone performing benchmarks for Cassandra changes should be aware that the predefined mode isn't relevant and that a user defined test should be used (maybe we should create one that would be used as standard benchmark). Here's the one I used :
[jira] [Commented] (CASSANDRA-14318) Debug logging can create massive performance issues
[ https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415846#comment-16415846 ] Alexander Dejanovski commented on CASSANDRA-14318: -- [~jjirsa]: apparently the ReadCallback class already logs at TRACE and not DEBUG on the latest 2.2. I've created the fix that downgrades debug logging to trace logging in the query pager classes, and here are the results : debug on - no fix : {noformat} Results: op rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] partition rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] row rate : 6681 [read_event_1:1109, read_event_2:1119, read_event_3:4452] latency mean : 19,1 [read_event_1:15,4, read_event_2:15,4, read_event_3:21,0] latency median : 15,6 [read_event_1:14,2, read_event_2:14,0, read_event_3:16,3] latency 95th percentile : 39,1 [read_event_1:28,4, read_event_2:28,6, read_event_3:44,2] latency 99th percentile : 75,6 [read_event_1:52,9, read_event_2:53,6, read_event_3:87,7] latency 99.9th percentile : 315,7 [read_event_1:101,0, read_event_2:110,1, read_event_3:361,1] latency max : 609,1 [read_event_1:319,6, read_event_2:315,9, read_event_3:609,1] Total partitions : 993050 [read_event_1:164882, read_event_2:166381, read_event_3:661787] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 189 total gc mb : 56464 total gc time (s) : 7 avg gc time(ms) : 37 stdev gc time(ms) : 8 Total operation time : 00:02:28{noformat} debug off - no fix : {noformat} Results: op rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] partition rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] row rate : 12655 [read_event_1:2141, read_event_2:2093, read_event_3:8422] latency mean : 10,1 [read_event_1:10,1, read_event_2:10,1, read_event_3:10,1] latency median : 9,2 [read_event_1:9,2, read_event_2:9,2, read_event_3:9,3] latency 95th percentile : 15,2 [read_event_1:15,8, read_event_2:15,9, read_event_3:15,7] latency 99th percentile : 29,3 [read_event_1:44,5, read_event_2:45,1, read_event_3:41,3] latency 99.9th percentile : 52,7 [read_event_1:67,9, read_event_2:66,9, read_event_3:67,1] latency max : 268,0 [read_event_1:257,1, read_event_2:263,3, read_event_3:268,0] Total partitions : 983056 [read_event_1:166311, read_event_2:162570, read_event_3:654175] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 100 total gc mb : 31529 total gc time (s) : 4 avg gc time(ms) : 37 stdev gc time(ms) : 5 Total operation time : 00:01:17{noformat} debug on - with fix : {noformat} Results: op rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] partition rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] row rate : 12289 [read_event_1:2058, read_event_2:2051, read_event_3:8181] latency mean : 10,4 [read_event_1:10,4, read_event_2:10,4, read_event_3:10,4] latency median : 9,4 [read_event_1:9,4, read_event_2:9,4, read_event_3:9,4] latency 95th percentile : 16,3 [read_event_1:16,8, read_event_2:17,3, read_event_3:16,2] latency 99th percentile : 36,6 [read_event_1:44,3, read_event_2:46,6, read_event_3:37,2] latency 99.9th percentile : 62,2 [read_event_1:78,0, read_event_2:77,1, read_event_3:80,8] latency max : 251,2 [read_event_1:246,9, read_event_2:249,9, read_event_3:251,2] Total partitions : 100 [read_event_1:167422, read_event_2:166861, read_event_3:665717] Total errors : 0 [read_event_1:0, read_event_2:0, read_event_3:0] total gc count : 102 total gc mb : 31843 total gc time (s) : 4 avg gc time(ms) : 38 stdev gc time(ms) : 6 Total operation time : 00:01:21{noformat} So we have similar performance with debug logging off and with the fix and debug on. The difference in throughput is pretty massive as we roughly get *twice the read throughput* with the fix. Latencies without the fix and with the fix : p95 : 35ms -> 16ms p99 : 75ms -> 36ms I've ran all tests several times, alternating with and without the fix to make sure caches were not making a difference, and results were consistent with what's pasted above. It's been running on a single node using an i3.xlarge instance for Cassandra and another i3.large for running cassandra-stress. *One pretty interesting thing to note* is that when I tested with the predefined mode of cassandra-stress, no paging occurred and the performance difference was not noticeable. This is due to the fact that the predefined mode generates COMPACT STORAGE tables, which involve a different read path (apparently). I think anyone performing benchmarks for Cassandra changes should be aware that the predefined mode isn't relevant and that a user defined test should be used (maybe we should create one that would be used as standard benchmark). Here's the one I used : [^cassandra-2.2-debug.yaml] With the following commands for
[jira] [Updated] (CASSANDRA-14318) Debug logging can create massive performance issues
[ https://issues.apache.org/jira/browse/CASSANDRA-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-14318: - Attachment: cassandra-2.2-debug.yaml > Debug logging can create massive performance issues > --- > > Key: CASSANDRA-14318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14318 > Project: Cassandra > Issue Type: Bug >Reporter: Alexander Dejanovski >Assignee: Alexander Dejanovski >Priority: Major > Labels: lhf, performance > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: cassandra-2.2-debug.yaml, debuglogging.png, flame22 > nodebug sjk svg.png, flame22-nodebug-sjk.svg, flame22-sjk.svg, > flame_graph_snapshot.png > > > Debug logging can involve in many cases (especially very low latency ones) a > very important overhead on the read path in 2.2 as we've seen when upgrading > clusters from 2.0 to 2.2. > The performance impact was especially noticeable on the client side metrics, > where p99 could go up to 10 times higher, while ClientRequest metrics > recorded by Cassandra didn't show any overhead. > Below shows latencies recorded on the client side with debug logging on > first, and then without it : > !debuglogging.png! > We generated a flame graph before turning off debug logging that shows the > read call stack is dominated by debug logging : > !flame_graph_snapshot.png! > I've attached the original flame graph for exploration. > Once disabled, the new flame graph shows that the read call stack gets > extremely thin, which is further confirmed by client recorded metrics : > !flame22 nodebug sjk svg.png! > The query pager code has been reworked since 3.0 and it looks like > log.debug() calls are gone there, but for 2.2 users and to prevent such > issues to appear with default settings, I really think debug logging should > be disabled by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14326) Handle verbose logging at a different level than DEBUG
[ https://issues.apache.org/jira/browse/CASSANDRA-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406721#comment-16406721 ] Alexander Dejanovski commented on CASSANDRA-14326: -- I agree it would be nice to keep incremental loggings indeed so that verbose contains info + verbose, and debug contains info + verbose + debug, but then we would have to do 2 changes to enable debug logging at will : * change to * uncomment the ASYNCDEBUGLOG appender Otherwise : * if the appender is there we always have something that's written to debug.log (all INFO level stuff) * and if o.a.c is at DEBUG all the time, any call to logger.debug() will have to be in a conditional block to avoid the performance penalty of interpreting the calls and have the appender filter out debug stuff. Unless there's a better way of achieving this ? > Handle verbose logging at a different level than DEBUG > -- > > Key: CASSANDRA-14326 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14326 > Project: Cassandra > Issue Type: Improvement >Reporter: Alexander Dejanovski >Priority: Major > Fix For: 4.x > > > CASSANDRA-10241 introduced debug logging turned on by default to act as a > verbose system.log and help troubleshoot production issues. > One of the consequence was to severely affect read performance in 2.2 as > contributors weren't all up to speed on how to use logging levels > (CASSANDRA-14318). > As DEBUG level has a very specific meaning in dev, it is confusing to use it > for always on verbose logging and should probably not be used this way in > Cassandra. > Options so far are : > # Bring back common loggings to INFO level (compactions, flushes, etc...) > and disable debug logging by default > # Use files named as verbose-system.log instead of debug.log and use a > custom logging level instead of DEBUG for verbose tracing, that would be > enabled by default. Debug logging would still exist and be disabled by > default in the root logger (not just filtered at the appender level). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org