Re: Soliciting volunteers for flaky dtests on trunk

2017-05-17 Thread Lerh Chuan Low
Hey Ariel,

It looks like you've closed the only JIRA I've found on CqlshSmokeTest (
https://issues.apache.org/jira/browse/CASSANDRA-13140) and as you mentioned
in the ticket, it hasn't been failing recently in both CassCI and Apache
Jenkins. I think we're gold for that one.

Would anyone like a hand with anything?

Lerh

On 18 May 2017 at 03:36, Ariel Weisberg  wrote:

> Hi,
>
> Thank you Blake, Lerh Chuan Low, Jason, and Kurt, and anyone else who
> volunteered.
>
> I'm going to look at repair_test.TestRepair which is not quite the same
> as repair_test.incremental_repair test which Blake is looking at.
>
> The one remaining somewhat high pole in the tent is
> cqlsh_tests.CqlshSmokeTest.
>
> Thanks,
> Ariel
>
> On Thu, May 11, 2017, at 01:12 PM, Jason Brown wrote:
> > I've taken
> > CASSANDRA-13507
> > CASSANDRA-13517
> >
> > -Jason
> >
> >
> > On Wed, May 10, 2017 at 9:45 PM, Lerh Chuan Low 
> > wrote:
> >
> > > I'll try my hand on https://issues.apache.org/
> jira/browse/CASSANDRA-13182.
> > >
> > > On 11 May 2017 at 05:59, Blake Eggleston  wrote:
> > >
> > > > I've taken CASSANDRA-13194, CASSANDRA-13506, CASSANDRA-13515,
> > > > and CASSANDRA-13372 to start
> > > >
> > > > On May 10, 2017 at 12:44:47 PM, Ariel Weisberg (ar...@weisberg.ws)
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > The dev list murdered my rich text formatted email. Here it is
> > > > reformatted as plain text.
> > > >
> > > > The unit tests are looking pretty reliable right now. There is a long
> > > > tail of infrequently failing tests but it's not bad and almost all
> > > > builds succeed in the current build environment. In CircleCI it seems
> > > > like unit tests might be a little less reliable, but still usable.
> > > >
> > > > The dtests on the other hand aren't producing clean builds yetl.
> There
> > > > is also a pretty diverse set of failing tests.
> > > >
> > > > I did a bit of triaging of the flakey dtests. I started by cataloging
> > > > everything, but what I found is that the long tail of flakey dtests
> is
> > > > very long indeed so I narrowed focus to just the top frequently
> failing
> > > > tests for now. See https://goo.gl/b96CdO
> > > >
> > > > I created spreadsheet with some of the failing tests. Links to JIRA,
> > > > last time the test was seen failing, and how many failures I found in
> > > > Apache Jenkins across the 3 dtest builds. There are a lot of failures
> > > > not listed. There would be 50+ entries if I cataloged each one.
> > > >
> > > > There are two hard failing tests, but both are already moving along:
> > > > CASSANDRA-13229 (Ready to commit, assigned Alex Petrov, Paulo Motta
> > > > reviewing, last updated April 2017) dtest failure in
> > > > topology_test.TestTopology.size_estimates_multidc_test
> > > > CASSANDRA-13113 (Ready to commit, assigned Alex Petrov, Sam T
> Reviewing,
> > > > last updated March 2017) test failure in
> > > > auth_test.TestAuth.system_auth_ks_is_alterable_test
> > > >
> > > > I think the tests we should tackle first are on this sheet in
> priority
> > > > order https://goo.gl/S3khv1
> > > >
> > > > Suite: bootstrap_test
> > > > Test: TestBootstrap.simultaneous_bootstrap_test
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13506
> > > > Last failure: 5/5/2017
> > > > Counted failures: 45
> > > >
> > > > Suite: repair_test
> > > > Test: incremental_repair_test.TestIncRepair.compaction_test
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13194
> > > > Last failure: 5/4/2017
> > > > Counted failures: 44
> > > >
> > > > Suite: sstableutil_test
> > > > Test: SSTableUtilTest.compaction_test
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13182
> > > > Last failure: 5/4/2017
> > > > Counted failures: 35
> > > >
> > > > Suite: paging_test
> > > > Test: TestPagingWithDeletions.test_ttl_deletions
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13507
> > > > Last failure: 4/25/2017
> > > > Counted failures: 31
> > > >
> > > > Suite: repair_test
> > > > Test: incremental_repair_test.TestIncRepair.multiple_repair_test
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13515
> > > > Last failed: 5/4/2017
> > > > Counted failures: 18
> > > >
> > > > Suite: cqlsh_tests
> > > > Test: cqlsh_copy_tests.CqlshCopyTest.test_bulk_round_trip_*
> > > > JIRA:
> > > > https://issues.apache.org/jira/issues/?jql=project%20%
> > > > 3D%20CASSANDRA%20AND%20status%20in%20(Open%2C%20%22In%
> > > > 20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22%
> > > > 2C%20%22Ready%20to%20Commit%22%2C%20%22Awaiting%
> > > > 20Feedback%22)%20AND%20text%20~%20%22CqlshCopyTest%22
> > > > Last failed: 5/8/2017
> > > > Counted failures: 23
> > > >
> > > > Suite: paxos_tests
> > > > Test: TestPaxos.contention_test_many_threads
> > > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13517
> > > > Last failed: 5/8/2017
> > > > Counted failures: 15
> > > >
> > > > Suite: 

Repair Management

2017-05-17 Thread Cameron Zemek
I am looking to improve monitoring and management of repairs (so far I have
patch for adding ActiveRepairs to table/keyspace metrics) and come across
ActiveRepairServiceMBean but this appears to be limited to incremental
repairs. Is there a reason for this?

I was looking to add something very similar to this nodetool repair_admin
but it would work on co-ordinator repair commands.

For example:
$ nodetool repair_admin --list
Repair#1 mykeyspace columnFamilies=colfamilya,colfamilyb; incremental=True;
parallelism=parallel progress=5%

$ nodetool repair_admin --terminate 1
Terminating repair command #1 (19f00c30-1390-11e7-bb50-ffb920a6d70f)

$ nodetool repair_admin --terminate-all  # calls
ssProxy.forceTerminateAllRepairSessions()
Terminating all repair sessions
Terminated repair command #2 (64c44230-21aa-11e7-9ede-cd6eb64e3786)

What is the purpose of the current repair_admin? If I wish to add the above
should I rename the MBean to say
org.apache.cassandra.db:type=IncrementalRepairService and the nodetool
command to inc_repair_admin ?


Re: Soliciting volunteers for flaky dtests on trunk

2017-05-17 Thread Ariel Weisberg
Hi,

Thank you Blake, Lerh Chuan Low, Jason, and Kurt, and anyone else who
volunteered.

I'm going to look at repair_test.TestRepair which is not quite the same
as repair_test.incremental_repair test which Blake is looking at. 

The one remaining somewhat high pole in the tent is
cqlsh_tests.CqlshSmokeTest.

Thanks,
Ariel

On Thu, May 11, 2017, at 01:12 PM, Jason Brown wrote:
> I've taken
> CASSANDRA-13507
> CASSANDRA-13517
> 
> -Jason
> 
> 
> On Wed, May 10, 2017 at 9:45 PM, Lerh Chuan Low 
> wrote:
> 
> > I'll try my hand on https://issues.apache.org/jira/browse/CASSANDRA-13182.
> >
> > On 11 May 2017 at 05:59, Blake Eggleston  wrote:
> >
> > > I've taken CASSANDRA-13194, CASSANDRA-13506, CASSANDRA-13515,
> > > and CASSANDRA-13372 to start
> > >
> > > On May 10, 2017 at 12:44:47 PM, Ariel Weisberg (ar...@weisberg.ws)
> > wrote:
> > >
> > > Hi,
> > >
> > > The dev list murdered my rich text formatted email. Here it is
> > > reformatted as plain text.
> > >
> > > The unit tests are looking pretty reliable right now. There is a long
> > > tail of infrequently failing tests but it's not bad and almost all
> > > builds succeed in the current build environment. In CircleCI it seems
> > > like unit tests might be a little less reliable, but still usable.
> > >
> > > The dtests on the other hand aren't producing clean builds yetl. There
> > > is also a pretty diverse set of failing tests.
> > >
> > > I did a bit of triaging of the flakey dtests. I started by cataloging
> > > everything, but what I found is that the long tail of flakey dtests is
> > > very long indeed so I narrowed focus to just the top frequently failing
> > > tests for now. See https://goo.gl/b96CdO
> > >
> > > I created spreadsheet with some of the failing tests. Links to JIRA,
> > > last time the test was seen failing, and how many failures I found in
> > > Apache Jenkins across the 3 dtest builds. There are a lot of failures
> > > not listed. There would be 50+ entries if I cataloged each one.
> > >
> > > There are two hard failing tests, but both are already moving along:
> > > CASSANDRA-13229 (Ready to commit, assigned Alex Petrov, Paulo Motta
> > > reviewing, last updated April 2017) dtest failure in
> > > topology_test.TestTopology.size_estimates_multidc_test
> > > CASSANDRA-13113 (Ready to commit, assigned Alex Petrov, Sam T Reviewing,
> > > last updated March 2017) test failure in
> > > auth_test.TestAuth.system_auth_ks_is_alterable_test
> > >
> > > I think the tests we should tackle first are on this sheet in priority
> > > order https://goo.gl/S3khv1
> > >
> > > Suite: bootstrap_test
> > > Test: TestBootstrap.simultaneous_bootstrap_test
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13506
> > > Last failure: 5/5/2017
> > > Counted failures: 45
> > >
> > > Suite: repair_test
> > > Test: incremental_repair_test.TestIncRepair.compaction_test
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13194
> > > Last failure: 5/4/2017
> > > Counted failures: 44
> > >
> > > Suite: sstableutil_test
> > > Test: SSTableUtilTest.compaction_test
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13182
> > > Last failure: 5/4/2017
> > > Counted failures: 35
> > >
> > > Suite: paging_test
> > > Test: TestPagingWithDeletions.test_ttl_deletions
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13507
> > > Last failure: 4/25/2017
> > > Counted failures: 31
> > >
> > > Suite: repair_test
> > > Test: incremental_repair_test.TestIncRepair.multiple_repair_test
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13515
> > > Last failed: 5/4/2017
> > > Counted failures: 18
> > >
> > > Suite: cqlsh_tests
> > > Test: cqlsh_copy_tests.CqlshCopyTest.test_bulk_round_trip_*
> > > JIRA:
> > > https://issues.apache.org/jira/issues/?jql=project%20%
> > > 3D%20CASSANDRA%20AND%20status%20in%20(Open%2C%20%22In%
> > > 20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22%
> > > 2C%20%22Ready%20to%20Commit%22%2C%20%22Awaiting%
> > > 20Feedback%22)%20AND%20text%20~%20%22CqlshCopyTest%22
> > > Last failed: 5/8/2017
> > > Counted failures: 23
> > >
> > > Suite: paxos_tests
> > > Test: TestPaxos.contention_test_many_threads
> > > JIRA: https://issues.apache.org/jira/browse/CASSANDRA-13517
> > > Last failed: 5/8/2017
> > > Counted failures: 15
> > >
> > > Suite: repair_test
> > > Test: TestRepair
> > > JIRA:
> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%
> > > 20Open%20AND%20text%20~%20%22dtest%20failure%20repair_test%22
> > > Last failure: 5/4/2017
> > > Comment: No one test fails a lot but the number of failing tests is
> > > substantial
> > >
> > > Suite: cqlsh_tests
> > > Test: cqlsh_tests.CqlshSmokeTest.[test_insert | test_truncate |
> > > test_use_keyspace | test_create_keyspace]
> > > JIRA: No JIRA yet
> > > Last failed: 4/22/2017
> > > count: 6
> > >
> > > If you have spare cycles you can make a huge difference in test
> > > stability by picking off one of