[ https://issues.apache.org/jira/browse/CASSANDRA-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229328#comment-14229328 ]
Alan Boudreault edited comment on CASSANDRA-8366 at 12/1/14 6:53 PM: --------------------------------------------------------------------- I have been able to reproduce the issue with 2.1.2 and branch cassandra-2.1. From my tests, the issue seems to be related the parallel incremental repairs. I don't see the issue with full repairs. With full repairs, the storage size increases but everything is fine after a compaction. With incremental repairs, I've seen nodes going from 1.5G to 15G of storage size. It looks like something is broken with inc repairs. Most of the time, I get one of the following errors during the repairs: * Repair session 6f6c4ae0-78d6-11e4-9b48-b56034537865 for range (3074457345618258602,-9223372036854775808] failed with error org.apache.cassandra.exceptions.RepairException: [repair #6f6c4ae0-78d6-11e4-9b48-b56034537865 on r1/Standard1, (3074457345618258602,-9223372036854775808]] Sync failed between /127.0.0.1 and /127.0.0.3 * Repair failed with error Did not get positive replies from all endpoints. List of failed endpoint(s): [127.0.0.1] So this issue might be related to CASSANDRA-8316 . I've attached the script I used to reproduce the issue and also 3 result files. was (Author: aboudreault): I have been able to reproduce the issue with 2.1.2 and branch cassandra-2.1. From my tests, the issue seems to be related the parallel incremental repairs. I don't see the issue with full repairs. With full repairs, the storage size increases but everything is fine after a compaction. With incremental repairs, I've seen nodes going from 1.5G to 15G of storage size. It looks like something is broken with inc repairs. Most of the time, I get one of the following errors during the repairs: * Repair session 6f6c4ae0-78d6-11e4-9b48-b56034537865 for range (3074457345618258602,-9223372036854775808] failed with error org.apache.cassandra.exceptions.RepairException: [repair #6f6c4ae0-78d6-11e4-9b48-b56034537865 on r1/Standard1, (3074457345618258602,-9223372036854775808]] Sync failed between /127.0.0.1 and /127.0.0.3 * Repair failed with error Did not get positive replies from all endpoints. List of failed endpoint(s): [127.0.0.1] So this issue might be related to CASSANDRA-8613 . I've attached the script I used to reproduce the issue and also 3 result files. > Repair grows data on nodes, causes load to become unbalanced > ------------------------------------------------------------ > > Key: CASSANDRA-8366 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8366 > Project: Cassandra > Issue Type: Bug > Environment: 4 node cluster > 2.1.2 Cassandra > Inserts and reads are done with CQL driver > Reporter: Jan Karlsson > Assignee: Alan Boudreault > Attachments: results-17500000_inc_repair.txt, > results-5000000_1_inc_repairs.txt, results-5000000_2_inc_repairs.txt, > results-5000000_full_repair_then_inc_repairs.txt, > results-5000000_inc_repairs_not_parallel.txt, test.sh > > > There seems to be something weird going on when repairing data. > I have a program that runs 2 hours which inserts 250 random numbers and reads > 250 times per second. It creates 2 keyspaces with SimpleStrategy and RF of 3. > I use size-tiered compaction for my cluster. > After those 2 hours I run a repair and the load of all nodes goes up. If I > run incremental repair the load goes up alot more. I saw the load shoot up 8 > times the original size multiple times with incremental repair. (from 2G to > 16G) > with node 9 8 7 and 6 the repro procedure looked like this: > (Note that running full repair first is not a requirement to reproduce.) > After 2 hours of 250 reads + 250 writes per second: > UN 9 583.39 MB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 584.01 MB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 583.72 MB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 583.84 MB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > Repair -pr -par on all nodes sequentially > UN 9 746.29 MB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 751.02 MB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 748.89 MB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 758.34 MB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > repair -inc -par on all nodes sequentially > UN 9 2.41 GB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 2.53 GB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 2.6 GB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 2.17 GB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > after rolling restart > UN 9 1.47 GB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 1.5 GB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 2.46 GB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 1.19 GB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > compact all nodes sequentially > UN 9 989.99 MB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 994.75 MB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 1.46 GB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 758.82 MB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > repair -inc -par on all nodes sequentially > UN 9 1.98 GB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 2.3 GB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 3.71 GB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 1.68 GB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > restart once more > UN 9 2 GB 256 ? 28220962-26ae-4eeb-8027-99f96e377406 rack1 > UN 8 2.05 GB 256 ? f2de6ea1-de88-4056-8fde-42f9c476a090 rack1 > UN 7 4.1 GB 256 ? 2b6b5d66-13c8-43d8-855c-290c0f3c3a0b rack1 > UN 6 1.68 GB 256 ? b8bd67f1-a816-46ff-b4a4-136ad5af6d4b rack1 > Is there something im missing or is this strange behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332)