[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

Alexander Dejanovski (JIRA) Fri, 31 Aug 2018 11:26:23 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Dejanovski updated CASSANDRA-14685:
---------------------------------------------
    Description: 
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000      
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"       
--compaction "{'class': 'SizeTieredCompactionStrategy'}"       --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
ccm node1 stop
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
ccm node1 start{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                          
     Rack
UN  127.0.0.1  228,64 KiB  256          ?       
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256          ?       
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256          ?       
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.

  was:
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000      
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"       
--compaction "{'class': 'SizeTieredCompactionStrategy'}"       --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                          
     Rack
UN  127.0.0.1  228,64 KiB  256          ?       
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256          ?       
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256          ?       
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.


> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14685
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Repair
>            Reporter: Alexander Dejanovski
>            Assignee: Jason Brown
>            Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000      
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"       
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"       --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens       Owns    Host ID                        
>        Rack
> UN  127.0.0.1  228,64 KiB  256          ?       
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
> UN  127.0.0.2  60,09 MiB  256          ?       
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
> UN  127.0.0.3  57,59 MiB  256          ?       
> a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
> {noformat}
> sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
> "pending repair" state :
> {noformat}
> ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
> grep repair
> SSTable: 
> /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
> Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
> {noformat}
> Restarting these nodes wouldn't help either.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

Reply via email to