[jira] [Updated] (CASSANDRA-10874) running stress with compaction strategy and replication factor fails on read after write

Andrew Hust (JIRA) Wed, 16 Dec 2015 10:41:49 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-10874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Hust updated CASSANDRA-10874:
------------------------------------
    Description: 
When running a read stress after write stress with a compaction strategy and 
replication factor matching the node count will fail with an exception.  
{code}
Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated
{code}

Example run:
{code}
ccm create stress -v git:cassandra-3.0 -n 3 -s
ccm node1 stress write n=10M -rate threads=300 -schema replication\(factor=3\) 
compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\)
ccm node1 nodetool flush
ccm node1 nodetool compactionstats # check until quiet
ccm node1 stress read n=10M -rate threads=300
{code}
- This will fail with/out vnodes but will occasionally pass without vnodes. 
- Changing the read phase to be CL=QUORUM will make it pass.  
- Removing the replication factor on write will make it pass.
- Happens on all compaction strategies

So with that in mind I attempted to add a repair after the write phase.  This 
leads to 1 of 2 outcomes.

1: a repair that has a greater than 100% completion, usually stalls after a 
bit, but have seen it get to >400% progress:
{code}
                                      id   compaction type    keyspace       
table     completed         total    unit   progress
    2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27        Validation   keyspace1   
standard1   94722609949   44035292145   bytes    215.11%
{code}

2: a repair that has a greatly inflated completed/total value, it will crunch 
for a bit then lockup:
{code}
                                     id   compaction type    keyspace       
table   completed          total    unit   progress
   8c4cf7f0-a34a-11e5-a321-777be88c58ae        Validation   keyspace1   
standard1           0   874811100900   bytes      0.00%

❯ du -sh ~/.ccm/stress/node1/
2.4G  ~/.ccm/stress/node1/
❯ du -sh ~/.ccm/stress
7.1G  ~/.ccm/stress
{code}

This has been reproduced on cassandra-3.0 and cassandra-2.1 both locally and 
using cstar_perf (links below).  
A big twist is that cassandra-2.2 will pass the majority of the time.  It will 
complete successfully without the repair 8 out of 10 runs.  This can be seen in 
the cstar_perf links below.

cstar_perf runs:
http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f
http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f

  was:
When running a read stress after write stress with a compaction strategy and 
replication factor matching the node count will fail with an exception.  
{code}
Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated
{code}

Example run:
{code}
ccm create stress -v git:cassandra-3.0 -n 3 -s
ccm node1 stress write n=10M -rate threads=300 -schema replication\(factor=3\) 
compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\)
ccm node1 nodetool flush
ccm node1 nodetool compactionstats # check until quiet
ccm node1 stress read n=10M -rate threads=300
{code}
- This will fail with/out vnodes but will occasionally pass without vnodes. 
- Changing the read phase to be CL=QUORUM will make it pass.  
- Removing the replication factor on write will make it pass.
- Happens on all compaction strategies

So with that in mind I attempted to add a repair after the write phase.  This 
leads to 1 of 2 outcomes.

1: a repair that has a greater than 100% completion, usually stalls after a 
bit, but have seen it get to >400% progress:
{code}
                                      id   compaction type    keyspace       
table     completed         total    unit   progress
    2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27        Validation   keyspace1   
standard1   94722609949   44035292145   bytes    215.11%
{code}

2: a repair that has a greatly inflated completed/total value, it will crunch 
for a bit then lockup:
{code}
                                     id   compaction type    keyspace       
table   completed          total    unit   progress
   8c4cf7f0-a34a-11e5-a321-777be88c58ae        Validation   keyspace1   
standard1           0   874811100900   bytes      0.00%

❯ du -sh ~/.ccm/stress/node1/
2.4G  ~/.ccm/stress/node1/
❯ du -sh ~/.ccm/stress
7.1G  ~/.ccm/stress
{code}

This has been reproduced on cassandra-3.0 and cassandra-2.2 both locally and 
using cstar_perf (links below).  
A big twist is that cassandra-2.2 will pass the majority of the time.  It will 
complete successfully without the repair 8 out of 10 runs.  This can be seen in 
the cstar_perf links below.

cstar_perf runs:
http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f
http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f


> running stress with compaction strategy and replication factor fails on read 
> after write
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10874
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10874
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Andrew Hust
>
> When running a read stress after write stress with a compaction strategy and 
> replication factor matching the node count will fail with an exception.  
> {code}
> Operation x0 on key(s) [38343433384b34364c30]: Data returned was not validated
> {code}
> Example run:
> {code}
> ccm create stress -v git:cassandra-3.0 -n 3 -s
> ccm node1 stress write n=10M -rate threads=300 -schema 
> replication\(factor=3\) 
> compaction\(strategy=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy\)
> ccm node1 nodetool flush
> ccm node1 nodetool compactionstats # check until quiet
> ccm node1 stress read n=10M -rate threads=300
> {code}
> - This will fail with/out vnodes but will occasionally pass without vnodes. 
> - Changing the read phase to be CL=QUORUM will make it pass.  
> - Removing the replication factor on write will make it pass.
> - Happens on all compaction strategies
> So with that in mind I attempted to add a repair after the write phase.  This 
> leads to 1 of 2 outcomes.
> 1: a repair that has a greater than 100% completion, usually stalls after a 
> bit, but have seen it get to >400% progress:
> {code}
>                                       id   compaction type    keyspace       
> table     completed         total    unit   progress
>     2d5344c0-9dc8-11e5-9d5f-4fdec8d76c27        Validation   keyspace1   
> standard1   94722609949   44035292145   bytes    215.11%
> {code}
> 2: a repair that has a greatly inflated completed/total value, it will crunch 
> for a bit then lockup:
> {code}
>                                      id   compaction type    keyspace       
> table   completed          total    unit   progress
>    8c4cf7f0-a34a-11e5-a321-777be88c58ae        Validation   keyspace1   
> standard1           0   874811100900   bytes      0.00%
> ❯ du -sh ~/.ccm/stress/node1/
> 2.4G  ~/.ccm/stress/node1/
> ❯ du -sh ~/.ccm/stress
> 7.1G  ~/.ccm/stress
> {code}
> This has been reproduced on cassandra-3.0 and cassandra-2.1 both locally and 
> using cstar_perf (links below).  
> A big twist is that cassandra-2.2 will pass the majority of the time.  It 
> will complete successfully without the repair 8 out of 10 runs.  This can be 
> seen in the cstar_perf links below.
> cstar_perf runs:
> http://cstar.datastax.com/tests/id/c8fa27a4-a205-11e5-8fbc-0256e416528f
> http://cstar.datastax.com/tests/id/a254c572-a2ce-11e5-a8b9-0256e416528f



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-10874) running stress with compaction strategy and replication factor fails on read after write

Reply via email to