Andrew Hust created CASSANDRA-10874:
---------------------------------------

             Summary: running stress with compaction strategy and replication 
factor fails on read after write
                 Key: CASSANDRA-10874
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10874
             Project: Cassandra
          Issue Type: Bug
          Components: Tools
            Reporter: Andrew Hust


When running a read stress after write stress with a compaction strategy and 
replication factor matching the node count will fail with an exception.  
{code}
Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated
{code}

Example run:
{code}
ccm create stress -v git:cassandra-3.0 -n 3 -s
ccm node1 stress write n=10M -rate threads=300 -schema replication\(factor=3\) 
compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\)
ccm node1 nodetool flush
ccm node1 nodetool compactionstats # check until quiet
ccm node1 stress read n=10M -rate threads=300
{code}
- This will fail with/out vnodes but will occasionally pass without vnodes. 
- Changing the read phase to be CL=QUORUM will make it pass.  
- Removing the replication factor on write will make it pass.
- Happens on all compaction strategies

So with that in mind I attempted to add a repair after the write phase.  This 
leads to 1 of 2 outcomes.

1: a repair that has a greater than 100% completion, usually stalls after a 
bit, but have seen it get to >400% progress:
{code}
                                      id   compaction type    keyspace       
table     completed         total    unit   progress
    2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27        Validation   keyspace1   
standard1   94722609949   44035292145   bytes    215.11%
{code}

2: a repair that has a greatly inflated completed/total value, it will crunch 
for a bit then lockup:
{code}
                                     id   compaction type    keyspace       
table   completed          total    unit   progress
   8c4cf7f0-a34a-11e5-a321-777be88c58ae        Validation   keyspace1   
standard1           0   874811100900   bytes      0.00%

❯ du -sh ~/.ccm/stress/node1/
2.4G  ~/.ccm/stress/node1/
❯ du -sh ~/.ccm/stress
7.1G  ~/.ccm/stress
{code}

This has been reproduced on cassandra-3.0 and cassandra-2.2 both locally and 
using cstar_perf (links below).  
A big twist is that cassandra-2.2 will pass the majority of the time.  It will 
complete successfully without the repair 8 out of 10 runs.  This can be seen in 
the cstar_perf links below.

cstar_perf runs:
http://cstar.datastax.com/tests/id/a8b6af02-a2ce-11e5-bb72-0256e416528f
http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to