Re: Repair slow, "Percent repaired" never updated

2018-06-06 Thread Martin Mačura
P.S.: Here's a corresponding log from the second node:

INFO  [AntiEntropyStage:1] 2018-06-04 13:37:16,409 Validator.java:281
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Sending completed
merkle tree to /14.0.53.234 for asm_log.event
INFO  [StreamReceiveTask:30] 2018-06-04 14:14:28,989
StreamResultFuture.java:187 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.53.234 is
complete
INFO  [StreamReceiveTask:30] 2018-06-04 14:14:28,990
StreamResultFuture.java:219 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed
INFO  [AntiEntropyStage:1] 2018-06-04 14:14:29,000
ActiveRepairService.java:452 - [repair
#af1aefc0-67c0-11e8-b07c-c365701888e8] Not a global repair, will not
do anticompaction


Why is there no anticompaction if it's an incremental repair?

We have two datacenters currently, this concerns the second one that
we recently brought up (with nodetool rebuild). We cannot do a repair
across datacenters, because nodes in the old DC would run out of disk
space.

Regards,

Martin



On Tue, Jun 5, 2018 at 6:06 PM, Martin Mačura  wrote:
> Hi,
> we're on cassandra 3.11.2, and we're having some issues with repairs.
> They take ages to complete, and some time ago the incremental repair
> stopped working - that is, SSTables are not being marked as repaired,
> even though the repair reports success.
>
> Running a full or incremental repair does not make any difference.
>
> Here's a log of a typical repair (omitted a lot of 'Maximum memory
> usage' messages):
>
> INFO  [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139
> - Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8),
> repairing keyspace prod with repair options (parallelism: parallel,
> primary range: false, incremental: true, job threads: 1,
> ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges:
> 1280, pull repair: false)
> INFO  [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228
> - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will
> sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event]
> INFO  [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 -
> [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees
> for event (to [/14.0.52.115, /14.0.53.234])
> INFO  [ValidationExecutor:10] 2018-06-04 06:31:13,859
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385
> NoSpamLogger.java:94 - Out of 14 commit log syncs over the past
> 134.02s with average duration of 34.90ms, 2 have exceeded the
> configured commit interval by an average of 60.66ms
> ...
> INFO  [ValidationExecutor:10] 2018-06-04 13:31:19,011
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> INFO  [AntiEntropyStage:1] 2018-06-04 13:37:17,357
> RepairSession.java:180 - [repair
> #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
> from /14.0.52.115
> INFO  [ValidationExecutor:10] 2018-06-04 13:46:19,281
> NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
> cannot allocate chunk of 1.000MiB
> INFO  [IndexSummaryManager:1] 2018-06-04 13:57:18,772
> IndexSummaryRedistribution.java:76 - Redistributing index summaries
> INFO  [AntiEntropyStage:1] 2018-06-04 13:58:21,971
> RepairSession.java:180 - [repair
> #afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
> from /14.0.53.234
> INFO  [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 -
> [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115
> and /14.0.53.234 have 15406 range(s) out of sync for event
> INFO  [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71
> - [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming
> repair of 15406 ranges with /14.0.52.115
> INFO  [RepairJobTask:4] 2018-06-04 13:59:49,075
> StreamResultFuture.java:90 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for
> Repair
> INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076
> StreamSession.java:266 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to
> /14.0.52.115
> INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089
> StreamCoordinator.java:264 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session
> with /14.0.52.115
> INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423
> StreamResultFuture.java:173 - [Stream
> #6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed.
> Receiving 321 files(6.238GiB), sending 318 files(6.209GiB)
> WARN  [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 -
> ConcurrentMarkSweep GC in 4095ms.  CMS Old Gen: 4086661264 ->
> 1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space:
> 21541464 -> 0
> ...
> WARN  [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288
> - Not marking nodes down due to 

Repair slow, "Percent repaired" never updated

2018-06-05 Thread Martin Mačura
Hi,
we're on cassandra 3.11.2, and we're having some issues with repairs.
They take ages to complete, and some time ago the incremental repair
stopped working - that is, SSTables are not being marked as repaired,
even though the repair reports success.

Running a full or incremental repair does not make any difference.

Here's a log of a typical repair (omitted a lot of 'Maximum memory
usage' messages):

INFO  [Repair-Task-12] 2018-06-04 06:29:50,396 RepairRunnable.java:139
- Starting repair command #11 (af1aefc0-67c0-11e8-b07c-c365701888e8),
repairing keyspace prod with repair options (parallelism: parallel,
primary range: false, incremental: true, job threads: 1,
ColumnFamilies: [event], dataCenters: [DC1], hosts: [], # of ranges:
1280, pull repair: false)
INFO  [Repair-Task-12] 2018-06-04 06:29:51,497 RepairSession.java:228
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] new session: will
sync /14.0.53.234, /14.0.52.115 on range [...] for asm_log.[event]
INFO  [Repair#11:1] 2018-06-04 06:29:51,776 RepairJob.java:169 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Requesting merkle trees
for event (to [/14.0.52.115, /14.0.53.234])
INFO  [ValidationExecutor:10] 2018-06-04 06:31:13,859
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-06-04 06:32:01,385
NoSpamLogger.java:94 - Out of 14 commit log syncs over the past
134.02s with average duration of 34.90ms, 2 have exceeded the
configured commit interval by an average of 60.66ms
...
INFO  [ValidationExecutor:10] 2018-06-04 13:31:19,011
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
INFO  [AntiEntropyStage:1] 2018-06-04 13:37:17,357
RepairSession.java:180 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
from /14.0.52.115
INFO  [ValidationExecutor:10] 2018-06-04 13:46:19,281
NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB),
cannot allocate chunk of 1.000MiB
INFO  [IndexSummaryManager:1] 2018-06-04 13:57:18,772
IndexSummaryRedistribution.java:76 - Redistributing index summaries
INFO  [AntiEntropyStage:1] 2018-06-04 13:58:21,971
RepairSession.java:180 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Received merkle tree for event
from /14.0.53.234
INFO  [RepairJobTask:4] 2018-06-04 13:58:39,780 SyncTask.java:73 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Endpoints /14.0.52.115
and /14.0.53.234 have 15406 range(s) out of sync for event
INFO  [RepairJobTask:4] 2018-06-04 13:58:39,781 LocalSyncTask.java:71
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Performing streaming
repair of 15406 ranges with /14.0.52.115
INFO  [RepairJobTask:4] 2018-06-04 13:59:49,075
StreamResultFuture.java:90 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Executing streaming plan for
Repair
INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,076
StreamSession.java:266 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Starting streaming to
/14.0.52.115
INFO  [StreamConnectionEstablisher:3] 2018-06-04 13:59:49,089
StreamCoordinator.java:264 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8, ID#0] Beginning stream session
with /14.0.52.115
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:01:14,423
StreamResultFuture.java:173 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8 ID#0] Prepare completed.
Receiving 321 files(6.238GiB), sending 318 files(6.209GiB)
WARN  [Service Thread] 2018-06-04 14:12:15,578 GCInspector.java:282 -
ConcurrentMarkSweep GC in 4095ms.  CMS Old Gen: 4086661264 ->
1107272664; Par Eden Space: 503316480 -> 0; Par Survivor Space:
21541464 -> 0
...
WARN  [GossipTasks:1] 2018-06-04 14:12:15,677 FailureDetector.java:288
- Not marking nodes down due to local pause of 5123793157 > 50
INFO  [ScheduledTasks:1] 2018-06-04 14:12:20,611 NoSpamLogger.java:91
- Some operations were slow, details available at debug level
(debug.log)
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,188
StreamResultFuture.java:187 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] Session with /14.0.52.115 is
complete
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
StreamResultFuture.java:219 - [Stream
#6244fd50-67ff-11e8-b07c-c365701888e8] All sessions completed
INFO  [STREAM-IN-/14.0.52.115:7000] 2018-06-04 14:14:29,190
LocalSyncTask.java:121 - [repair
#afc2ef90-67c0-11e8-b07c-c365701888e8] Sync complete using session
afc2ef90-67c0-11e8-b07c-c365701888e8 between /14.0.52.115 and
/14.0.53.234 on event
INFO  [RepairJobTask:5] 2018-06-04 14:14:29,191 RepairJob.java:143 -
[repair #afc2ef90-67c0-11e8-b07c-c365701888e8] event is fully synced
INFO  [RepairJobTask:5] 2018-06-04 14:14:29,193 RepairSession.java:270
- [repair #afc2ef90-67c0-11e8-b07c-c365701888e8] Session completed
successfully



Tablestats:
   SSTable count: 714
   Space used (live): 489416489322
   Space used (total): 489416489322
   Space used by snapshots (total): 0