Re: Repair slow, "Percent repaired" never updated
P.S.: Here's a corresponding log from the second node: INFO [AntiEntropyStage:1] 2018-06-04 13:37:16,409 Validator.java:281 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Sending completed merkle tree to /14.0.53.234 for asm_log.event INFO [StreamReceiveTask:30] 2018-06-04 14:14:28,989 StreamResultFuture.java:187 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.53.234 is complete INFO [StreamReceiveTask:30] 2018-06-04 14:14:28,990 StreamResultFuture.java:219 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed INFO [AntiEntropyStage:1] 2018-06-04 14:14:29,000 ActiveRepairService.java:452 - [repair #af1aefc0-67c0-11e8-b07c-c365701888e8] Not a global repair, will not do anticompaction Why is there no anticompaction if it's an incremental repair? We have two datacenters currently, this concerns the second one that we recently brought up (with nodetool rebuild). We cannot do a repair across datacenters, because nodes in the old DC would run out of disk space. Regards, Martin On Tue, Jun 5, 2018 at 6:06 PM, Martin Mačura wrote: > Hi, > we're on cassandra 3.11.2, and we're having some issues with repairs. > They take ages to complete, and some time ago the incremental repair > stopped working - that is, SSTables are not being marked as repaired, > even though the repair reports success. > > Running a full or incremental repair does not make any difference. > > Here's a log of a typical repair (omitted a lot of 'Maximum memory > usage' messages): > > INFO [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139 > - Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8), > repairing keyspace prod with repair options (parallelism: parallel, > primary range: false, incremental: true, job threads: 1, > ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges: > 1280, pull repair: false) > INFO [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228 > - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will > sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event] > INFO [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 - > [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees > for event (to [/14.0.52.115, /14.0.53.234]) > INFO [ValidationExecutor:10] 2018-06-04 06:31:13,859 > NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), > cannot allocate chunk of 1.000MiB > WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385 > NoSpamLogger.java:94 - Out of 14 commit log syncs over the past > 134.02s with average duration of 34.90ms, 2 have exceeded the > configured commit interval by an average of 60.66ms > ... > INFO [ValidationExecutor:10] 2018-06-04 13:31:19,011 > NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), > cannot allocate chunk of 1.000MiB > INFO [AntiEntropyStage:1] 2018-06-04 13:37:17,357 > RepairSession.java:180 - [repair > #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event > from /14.0.52.115 > INFO [ValidationExecutor:10] 2018-06-04 13:46:19,281 > NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), > cannot allocate chunk of 1.000MiB > INFO [IndexSummaryManager:1] 2018-06-04 13:57:18,772 > IndexSummaryRedistribution.java:76 - Redistributing index summaries > INFO [AntiEntropyStage:1] 2018-06-04 13:58:21,971 > RepairSession.java:180 - [repair > #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event > from /14.0.53.234 > INFO [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 - > [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115 > and /14.0.53.234 have 15406 range(s) out of sync for event > INFO [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71 > - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming > repair of 15406 ranges with /14.0.52.115 > INFO [RepairJobTask:4] 2018-06-04 13:59:49,075 > StreamResultFuture.java:90 - [Stream > #6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for > Repair > INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076 > StreamSession.java:266 - [Stream > #6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to > /14.0.52.115 > INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089 > StreamCoordinator.java:264 - [Stream > #6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session > with /14.0.52.115 > INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423 > StreamResultFuture.java:173 - [Stream > #6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed. > Receiving 321 files(6.238GiB), sending 318 files(6.209GiB) > WARN [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 - > ConcurrentMarkSweep GC in 4095ms. CMS Old Gen: 4086661264 -> > 1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space: > 21541464 -> 0 > ... > WARN [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288 > - Not marking nodes down due to
Repair slow, "Percent repaired" never updated
Hi, we're on cassandra 3.11.2, and we're having some issues with repairs. They take ages to complete, and some time ago the incremental repair stopped working - that is, SSTables are not being marked as repaired, even though the repair reports success. Running a full or incremental repair does not make any difference. Here's a log of a typical repair (omitted a lot of 'Maximum memory usage' messages): INFO [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139 - Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8), repairing keyspace prod with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges: 1280, pull repair: false) INFO [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event] INFO [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees for event (to [/14.0.52.115, /14.0.53.234]) INFO [ValidationExecutor:10] 2018-06-04 06:31:13,859 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385 NoSpamLogger.java:94 - Out of 14 commit log syncs over the past 134.02s with average duration of 34.90ms, 2 have exceeded the configured commit interval by an average of 60.66ms ... INFO [ValidationExecutor:10] 2018-06-04 13:31:19,011 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB INFO [AntiEntropyStage:1] 2018-06-04 13:37:17,357 RepairSession.java:180 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event from /14.0.52.115 INFO [ValidationExecutor:10] 2018-06-04 13:46:19,281 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB INFO [IndexSummaryManager:1] 2018-06-04 13:57:18,772 IndexSummaryRedistribution.java:76 - Redistributing index summaries INFO [AntiEntropyStage:1] 2018-06-04 13:58:21,971 RepairSession.java:180 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event from /14.0.53.234 INFO [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115 and /14.0.53.234 have 15406 range(s) out of sync for event INFO [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming repair of 15406 ranges with /14.0.52.115 INFO [RepairJobTask:4] 2018-06-04 13:59:49,075 StreamResultFuture.java:90 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for Repair INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076 StreamSession.java:266 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to /14.0.52.115 INFO [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089 StreamCoordinator.java:264 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session with /14.0.52.115 INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423 StreamResultFuture.java:173 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed. Receiving 321 files(6.238GiB), sending 318 files(6.209GiB) WARN [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 - ConcurrentMarkSweep GC in 4095ms. CMS Old Gen: 4086661264 -> 1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space: 21541464 -> 0 ... WARN [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288 - Not marking nodes down due to local pause of 5123793157 > 50 INFO [ScheduledTasks:1] 2018-06-04 14:12:20,611 NoSpamLogger.java:91 - Some operations were slow, details available at debug level (debug.log) INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,188 StreamResultFuture.java:187 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.52.115 is complete INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190 StreamResultFuture.java:219 - [Stream #6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed INFO [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190 LocalSyncTask.java:121 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Sync complete using session afc2ef90-67c0-11e8-b07c-c365701888e8 between /14.0.52.115 and /14.0.53.234 on event INFO [RepairJobTask:5] 2018-06-04 14:14:29,191 RepairJob.java:143 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] event is fully synced INFO [RepairJobTask:5] 2018-06-04 14:14:29,193 RepairSession.java:270 - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Session completed successfully Tablestats: SSTable count: 714 Space used (live): 489416489322 Space used (total): 489416489322 Space used by snapshots (total): 0