Re: Soliciting volunteers for flaky dtests on trunk
Hey Ariel, It looks like you've closed the only JIRA I've found on CqlshSmokeTest ( https://issues.apache.org/jira/browse/CASSANDRA-13140) and as you mentioned in the ticket, it hasn't been failing recently in both CassCI and Apache Jenkins. I think we're gold for that one. Would anyone like a hand with anything? Lerh On 18 May 2017 at 03:36, Ariel Weisbergwrote: > Hi, > > Thank you Blake, Lerh Chuan Low, Jason, and Kurt, and anyone else who > volunteered. > > I'm going to look at repair_test.TestRepair which is not quite the same > as repair_test.incremental_repair test which Blake is looking at. > > The one remaining somewhat high pole in the tent is > cqlsh_tests.CqlshSmokeTest. > > Thanks, > Ariel > > On Thu, May 11, 2017, at 01:12 PM, Jason Brown wrote: > > I've taken > > CASSANDRA-13507 > > CASSANDRA-13517 > > > > -Jason > > > > > > On Wed, May 10, 2017 at 9:45 PM, Lerh Chuan Low > > wrote: > > > > > I'll try my hand on https://issues.apache.org/ > jira/browse/CASSANDRA-13182. > > > > > > On 11 May 2017 at 05:59, Blake Eggleston wrote: > > > > > > > I've taken CASSANDRA-13194, CASSANDRA-13506, CASSANDRA-13515, > > > > and CASSANDRA-13372 to start > > > > > > > > On May 10, 2017 at 12:44:47 PM, Ariel Weisberg (ar...@weisberg.ws) > > > wrote: > > > > > > > > Hi, > > > > > > > > The dev list murdered my rich text formatted email. Here it is > > > > reformatted as plain text. > > > > > > > > The unit tests are looking pretty reliable right now. There is a long > > > > tail of infrequently failing tests but it's not bad and almost all > > > > builds succeed in the current build environment. In CircleCI it seems > > > > like unit tests might be a little less reliable, but still usable. > > > > > > > > The dtests on the other hand aren't producing clean builds yetl. > There > > > > is also a pretty diverse set of failing tests. > > > > > > > > I did a bit of triaging of the flakey dtests. I started by cataloging > > > > everything, but what I found is that the long tail of flakey dtests > is > > > > very long indeed so I narrowed focus to just the top frequently > failing > > > > tests for now. See https://goo.gl/b96CdO > > > > > > > > I created spreadsheet with some of the failing tests. Links to JIRA, > > > > last time the test was seen failing, and how many failures I found in > > > > Apache Jenkins across the 3 dtest builds. There are a lot of failures > > > > not listed. There would be 50+ entries if I cataloged each one. > > > > > > > > There are two hard failing tests, but both are already moving along: > > > > CASSANDRA-13229 (Ready to commit, assigned Alex Petrov, Paulo Motta > > > > reviewing, last updated April 2017) dtest failure in > > > > topology_test.TestTopology.size_estimates_multidc_test > > > > CASSANDRA-13113 (Ready to commit, assigned Alex Petrov, Sam T > Reviewing, > > > > last updated March 2017) test failure in > > > > auth_test.TestAuth.system_auth_ks_is_alterable_test > > > > > > > > I think the tests we should tackle first are on this sheet in > priority > > > > order https://goo.gl/S3khv1 > > > > > > > > Suite: bootstrap_test > > > > Test: TestBootstrap.simultaneous_bootstrap_test > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13506 > > > > Last failure: 5/5/2017 > > > > Counted failures: 45 > > > > > > > > Suite: repair_test > > > > Test: incremental_repair_test.TestIncRepair.compaction_test > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13194 > > > > Last failure: 5/4/2017 > > > > Counted failures: 44 > > > > > > > > Suite: sstableutil_test > > > > Test: SSTableUtilTest.compaction_test > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13182 > > > > Last failure: 5/4/2017 > > > > Counted failures: 35 > > > > > > > > Suite: paging_test > > > > Test: TestPagingWithDeletions.test_ttl_deletions > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13507 > > > > Last failure: 4/25/2017 > > > > Counted failures: 31 > > > > > > > > Suite: repair_test > > > > Test: incremental_repair_test.TestIncRepair.multiple_repair_test > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13515 > > > > Last failed: 5/4/2017 > > > > Counted failures: 18 > > > > > > > > Suite: cqlsh_tests > > > > Test: cqlsh_copy_tests.CqlshCopyTest.test_bulk_round_trip_* > > > > JIRA: > > > > https://issues.apache.org/jira/issues/?jql=project%20% > > > > 3D%20CASSANDRA%20AND%20status%20in%20(Open%2C%20%22In% > > > > 20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22% > > > > 2C%20%22Ready%20to%20Commit%22%2C%20%22Awaiting% > > > > 20Feedback%22)%20AND%20text%20~%20%22CqlshCopyTest%22 > > > > Last failed: 5/8/2017 > > > > Counted failures: 23 > > > > > > > > Suite: paxos_tests > > > > Test: TestPaxos.contention_test_many_threads > > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13517 > > > > Last failed: 5/8/2017 > > > > Counted failures: 15 > > > > > > > > Suite:
Repair Management
I am looking to improve monitoring and management of repairs (so far I have patch for adding ActiveRepairs to table/keyspace metrics) and come across ActiveRepairServiceMBean but this appears to be limited to incremental repairs. Is there a reason for this? I was looking to add something very similar to this nodetool repair_admin but it would work on co-ordinator repair commands. For example: $ nodetool repair_admin --list Repair#1 mykeyspace columnFamilies=colfamilya,colfamilyb; incremental=True; parallelism=parallel progress=5% $ nodetool repair_admin --terminate 1 Terminating repair command #1 (19f00c30-1390-11e7-bb50-ffb920a6d70f) $ nodetool repair_admin --terminate-all # calls ssProxy.forceTerminateAllRepairSessions() Terminating all repair sessions Terminated repair command #2 (64c44230-21aa-11e7-9ede-cd6eb64e3786) What is the purpose of the current repair_admin? If I wish to add the above should I rename the MBean to say org.apache.cassandra.db:type=IncrementalRepairService and the nodetool command to inc_repair_admin ?
Re: Soliciting volunteers for flaky dtests on trunk
Hi, Thank you Blake, Lerh Chuan Low, Jason, and Kurt, and anyone else who volunteered. I'm going to look at repair_test.TestRepair which is not quite the same as repair_test.incremental_repair test which Blake is looking at. The one remaining somewhat high pole in the tent is cqlsh_tests.CqlshSmokeTest. Thanks, Ariel On Thu, May 11, 2017, at 01:12 PM, Jason Brown wrote: > I've taken > CASSANDRA-13507 > CASSANDRA-13517 > > -Jason > > > On Wed, May 10, 2017 at 9:45 PM, Lerh Chuan Low> wrote: > > > I'll try my hand on https://issues.apache.org/jira/browse/CASSANDRA-13182. > > > > On 11 May 2017 at 05:59, Blake Eggleston wrote: > > > > > I've taken CASSANDRA-13194, CASSANDRA-13506, CASSANDRA-13515, > > > and CASSANDRA-13372 to start > > > > > > On May 10, 2017 at 12:44:47 PM, Ariel Weisberg (ar...@weisberg.ws) > > wrote: > > > > > > Hi, > > > > > > The dev list murdered my rich text formatted email. Here it is > > > reformatted as plain text. > > > > > > The unit tests are looking pretty reliable right now. There is a long > > > tail of infrequently failing tests but it's not bad and almost all > > > builds succeed in the current build environment. In CircleCI it seems > > > like unit tests might be a little less reliable, but still usable. > > > > > > The dtests on the other hand aren't producing clean builds yetl. There > > > is also a pretty diverse set of failing tests. > > > > > > I did a bit of triaging of the flakey dtests. I started by cataloging > > > everything, but what I found is that the long tail of flakey dtests is > > > very long indeed so I narrowed focus to just the top frequently failing > > > tests for now. See https://goo.gl/b96CdO > > > > > > I created spreadsheet with some of the failing tests. Links to JIRA, > > > last time the test was seen failing, and how many failures I found in > > > Apache Jenkins across the 3 dtest builds. There are a lot of failures > > > not listed. There would be 50+ entries if I cataloged each one. > > > > > > There are two hard failing tests, but both are already moving along: > > > CASSANDRA-13229 (Ready to commit, assigned Alex Petrov, Paulo Motta > > > reviewing, last updated April 2017) dtest failure in > > > topology_test.TestTopology.size_estimates_multidc_test > > > CASSANDRA-13113 (Ready to commit, assigned Alex Petrov, Sam T Reviewing, > > > last updated March 2017) test failure in > > > auth_test.TestAuth.system_auth_ks_is_alterable_test > > > > > > I think the tests we should tackle first are on this sheet in priority > > > order https://goo.gl/S3khv1 > > > > > > Suite: bootstrap_test > > > Test: TestBootstrap.simultaneous_bootstrap_test > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13506 > > > Last failure: 5/5/2017 > > > Counted failures: 45 > > > > > > Suite: repair_test > > > Test: incremental_repair_test.TestIncRepair.compaction_test > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13194 > > > Last failure: 5/4/2017 > > > Counted failures: 44 > > > > > > Suite: sstableutil_test > > > Test: SSTableUtilTest.compaction_test > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13182 > > > Last failure: 5/4/2017 > > > Counted failures: 35 > > > > > > Suite: paging_test > > > Test: TestPagingWithDeletions.test_ttl_deletions > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13507 > > > Last failure: 4/25/2017 > > > Counted failures: 31 > > > > > > Suite: repair_test > > > Test: incremental_repair_test.TestIncRepair.multiple_repair_test > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13515 > > > Last failed: 5/4/2017 > > > Counted failures: 18 > > > > > > Suite: cqlsh_tests > > > Test: cqlsh_copy_tests.CqlshCopyTest.test_bulk_round_trip_* > > > JIRA: > > > https://issues.apache.org/jira/issues/?jql=project%20% > > > 3D%20CASSANDRA%20AND%20status%20in%20(Open%2C%20%22In% > > > 20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22% > > > 2C%20%22Ready%20to%20Commit%22%2C%20%22Awaiting% > > > 20Feedback%22)%20AND%20text%20~%20%22CqlshCopyTest%22 > > > Last failed: 5/8/2017 > > > Counted failures: 23 > > > > > > Suite: paxos_tests > > > Test: TestPaxos.contention_test_many_threads > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13517 > > > Last failed: 5/8/2017 > > > Counted failures: 15 > > > > > > Suite: repair_test > > > Test: TestRepair > > > JIRA: > > > https://issues.apache.org/jira/issues/?jql=status%20%3D% > > > 20Open%20AND%20text%20~%20%22dtest%20failure%20repair_test%22 > > > Last failure: 5/4/2017 > > > Comment: No one test fails a lot but the number of failing tests is > > > substantial > > > > > > Suite: cqlsh_tests > > > Test: cqlsh_tests.CqlshSmokeTest.[test_insert | test_truncate | > > > test_use_keyspace | test_create_keyspace] > > > JIRA: No JIRA yet > > > Last failed: 4/22/2017 > > > count: 6 > > > > > > If you have spare cycles you can make a huge difference in test > > > stability by picking off one of