[
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Dejanovski updated CASSANDRA-14685:
---------------------------------------------
Description:
The changes in CASSANDRA-9143 modified the way incremental repair performs by
applying the following sequence of events :
* Anticompaction is executed on all replicas for all SSTables overlapping the
repaired ranges
* Anticompacted SSTables are then marked as "Pending repair" and cannot be
compacted anymore, nor part of another repair session
* Merkle trees are generated and compared
* Streaming takes place if needed
* Anticompaction is committed and "pending repair" table are marked as
repaired if it succeeded, or they are released if the repair session failed.
If the repair coordinator dies during the streaming phase, *the SSTables on the
replicas will remain in "pending repair" state and will never be eligible for
repair or compaction*, even after all the nodes in the cluster are restarted.
Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors)
:
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
sed -i'' -e
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
$f
done
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
grep -v "rpc_" $f > ${f}.tmp
cat ${f}.tmp > $f
done
ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"
--compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
ccm node1 stop
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
ccm node1 start{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 127.0.0.1 228,64 KiB 256 ?
437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1
UN 127.0.0.2 60,09 MiB 256 ?
fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1
UN 127.0.0.3 57,59 MiB 256 ?
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db |
grep repair
SSTable:
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.
was:
The changes in CASSANDRA-9143 modified the way incremental repair performs by
applying the following sequence of events :
* Anticompaction is executed on all replicas for all SSTables overlapping the
repaired ranges
* Anticompacted SSTables are then marked as "Pending repair" and cannot be
compacted anymore, nor part of another repair session
* Merkle trees are generated and compared
* Streaming takes place if needed
* Anticompaction is committed and "pending repair" table are marked as
repaired if it succeeded, or they are released if the repair session failed.
If the repair coordinator dies during the streaming phase, *the SSTables on the
replicas will remain in "pending repair" state and will never be eligible for
repair or compaction*, even after all the nodes in the cluster are restarted.
Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors)
:
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
sed -i'' -e
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
$f
done
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
grep -v "rpc_" $f > ${f}.tmp
cat ${f}.tmp > $f
done
ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"
--compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 127.0.0.1 228,64 KiB 256 ?
437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1
UN 127.0.0.2 60,09 MiB 256 ?
fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1
UN 127.0.0.3 57,59 MiB 256 ?
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db |
grep repair
SSTable:
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.
> Incremental repair 4.0 : SSTables remain locked forever if the coordinator
> dies during streaming
> -------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
> Issue Type: Bug
> Components: Repair
> Reporter: Alexander Dejanovski
> Assignee: Jason Brown
> Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by
> applying the following sequence of events :
> * Anticompaction is executed on all replicas for all SSTables overlapping
> the repaired ranges
> * Anticompacted SSTables are then marked as "Pending repair" and cannot be
> compacted anymore, nor part of another repair session
> * Merkle trees are generated and compared
> * Streaming takes place if needed
> * Anticompaction is committed and "pending repair" table are marked as
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on
> the replicas will remain in "pending repair" state and will never be eligible
> for repair or compaction*, even after all the nodes in the cluster are
> restarted.
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming
> errors) :
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
> sed -i'' -e
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
> $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
> grep -v "rpc_" $f > ${f}.tmp
> cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a
> few 10s of MBs of data (killed it after some time). Obviously
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"
> --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN 127.0.0.1 228,64 KiB 256 ?
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1
> UN 127.0.0.2 60,09 MiB 256 ?
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1
> UN 127.0.0.3 57,59 MiB 256 ?
> a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1
> {noformat}
> sstablemetadata will then show that nodes 2 and 3 have SSTables still in
> "pending repair" state :
> {noformat}
> ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db |
> grep repair
> SSTable:
> /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
> Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
> {noformat}
> Restarting these nodes wouldn't help either.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]