[ https://issues.apache.org/jira/browse/CASSANDRA-10874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Hust updated CASSANDRA-10874: ------------------------------------ Description: When running a read stress after write stress with a compaction strategy and replication factor matching the node count will fail with an exception. {code} Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated {code} Example run: {code} ccm create stress -v git:cassandra-3.0 -n 3 -s ccm node1 stress write n=10M -rate threads=300 -schema replication\(factor=3\) compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\) ccm node1 nodetool flush ccm node1 nodetool compactionstats # check until quiet ccm node1 stress read n=10M -rate threads=300 {code} - This will fail with/out vnodes but will occasionally pass without vnodes. - Changing the read phase to be CL=QUORUM will make it pass. - Removing the replication factor on write will make it pass. - Happens on all compaction strategies So with that in mind I attempted to add a repair after the write phase. This leads to 1 of 2 outcomes. 1: a repair that has a greater than 100% completion, usually stalls after a bit, but have seen it get to >400% progress: {code} id compaction type keyspace table completed total unit progress 2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27 Validation keyspace1 standard1 94722609949 44035292145 bytes 215.11% {code} 2: a repair that has a greatly inflated completed/total value, it will crunch for a bit then lockup: {code} id compaction type keyspace table completed total unit progress 8c4cf7f0-a34a-11e5-a321-777be88c58ae Validation keyspace1 standard1 0 874811100900 bytes 0.00% ❯ du -sh ~/.ccm/stress/node1/ 2.4G ~/.ccm/stress/node1/ ❯ du -sh ~/.ccm/stress 7.1G ~/.ccm/stress {code} This has been reproduced on cassandra-3.0 and cassandra-2.1 both locally and using cstar_perf (links below). A big twist is that cassandra-2.2 will pass the majority of the time. It will complete successfully without the repair 8 out of 10 runs. This can be seen in the cstar_perf links below. cstar_perf runs: http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f was: When running a read stress after write stress with a compaction strategy and replication factor matching the node count will fail with an exception. {code} Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated {code} Example run: {code} ccm create stress -v git:cassandra-3.0 -n 3 -s ccm node1 stress write n=10M -rate threads=300 -schema replication\(factor=3\) compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\) ccm node1 nodetool flush ccm node1 nodetool compactionstats # check until quiet ccm node1 stress read n=10M -rate threads=300 {code} - This will fail with/out vnodes but will occasionally pass without vnodes. - Changing the read phase to be CL=QUORUM will make it pass. - Removing the replication factor on write will make it pass. - Happens on all compaction strategies So with that in mind I attempted to add a repair after the write phase. This leads to 1 of 2 outcomes. 1: a repair that has a greater than 100% completion, usually stalls after a bit, but have seen it get to >400% progress: {code} id compaction type keyspace table completed total unit progress 2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27 Validation keyspace1 standard1 94722609949 44035292145 bytes 215.11% {code} 2: a repair that has a greatly inflated completed/total value, it will crunch for a bit then lockup: {code} id compaction type keyspace table completed total unit progress 8c4cf7f0-a34a-11e5-a321-777be88c58ae Validation keyspace1 standard1 0 874811100900 bytes 0.00% ❯ du -sh ~/.ccm/stress/node1/ 2.4G ~/.ccm/stress/node1/ ❯ du -sh ~/.ccm/stress 7.1G ~/.ccm/stress {code} This has been reproduced on cassandra-3.0 and cassandra-2.2 both locally and using cstar_perf (links below). A big twist is that cassandra-2.2 will pass the majority of the time. It will complete successfully without the repair 8 out of 10 runs. This can be seen in the cstar_perf links below. cstar_perf runs: http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f > running stress with compaction strategy and replication factor fails on read > after write > ---------------------------------------------------------------------------------------- > > Key: CASSANDRA-10874 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10874 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Andrew Hust > > When running a read stress after write stress with a compaction strategy and > replication factor matching the node count will fail with an exception. > {code} > Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated > {code} > Example run: > {code} > ccm create stress -v git:cassandra-3.0 -n 3 -s > ccm node1 stress write n=10M -rate threads=300 -schema > replication\(factor=3\) > compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\) > ccm node1 nodetool flush > ccm node1 nodetool compactionstats # check until quiet > ccm node1 stress read n=10M -rate threads=300 > {code} > - This will fail with/out vnodes but will occasionally pass without vnodes. > - Changing the read phase to be CL=QUORUM will make it pass. > - Removing the replication factor on write will make it pass. > - Happens on all compaction strategies > So with that in mind I attempted to add a repair after the write phase. This > leads to 1 of 2 outcomes. > 1: a repair that has a greater than 100% completion, usually stalls after a > bit, but have seen it get to >400% progress: > {code} > id compaction type keyspace > table completed total unit progress > 2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27 Validation keyspace1 > standard1 94722609949 44035292145 bytes 215.11% > {code} > 2: a repair that has a greatly inflated completed/total value, it will crunch > for a bit then lockup: > {code} > id compaction type keyspace > table completed total unit progress > 8c4cf7f0-a34a-11e5-a321-777be88c58ae Validation keyspace1 > standard1 0 874811100900 bytes 0.00% > ❯ du -sh ~/.ccm/stress/node1/ > 2.4G ~/.ccm/stress/node1/ > ❯ du -sh ~/.ccm/stress > 7.1G ~/.ccm/stress > {code} > This has been reproduced on cassandra-3.0 and cassandra-2.1 both locally and > using cstar_perf (links below). > A big twist is that cassandra-2.2 will pass the majority of the time. It > will complete successfully without the repair 8 out of 10 runs. This can be > seen in the cstar_perf links below. > cstar_perf runs: > http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f > http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f -- This message was sent by Atlassian JIRA (v6.3.4#6332)