Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/22079 )
Change subject: [compaction] Add memory tracking for better OOM issues triaging ...................................................................... Patch Set 9: (6 comments) http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/compaction-highmem-test.cc File src/kudu/tablet/compaction-highmem-test.cc: http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/compaction-highmem-test.cc@261 PS9, Line 261: } // namespace tablet Do you think it makes sense adding a test scenario to make sure the usage of the 'major_compaction' global memtracker behaves as expected? At least, making sure the following holds true: * once a major compaction operation starts and running, the usage of 'major_compaction' increases * after a single major compaction is completed and corresponding objects are destroyed, the usage of the 'major_compaction' global memtracker is the same as it was before starting the op http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@132 PS9, Line 132: op may not complete Is it really possible to have a major delta compaction operation left incomplete due to insufficient memory while the process (kudu-master or kudu-tserver) continues chugging along? May a major compaction operation be aborted when it runs under memory pressure, or the idea was to warn about looming OOM condition where the whole process might crash or be killed by the OS? I'm trying to understand how actionable this message is. I don't know much about compaction internals, and seems a little vague to me. What cluster operators are supposed to do when they see this in the logs? http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@144 PS9, Line 144: mem_consumed == 0 nit: do we expect this method called with 0 usage at all? If not, maybe DCHECK_NE(0, mem_consumed) is a better way to handle this? If yes, then I'd expect that's quite a rare occurrence, so maybe it make sense to wrap this into PREDICT_FALSE then? >From the other side, looking at MemTracker::Consume() and >MemTracker::Release(), I can see the case of zero usage is legit and already >handled appropriately -- if so, then why to have this extra check at a higher >level? http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@193 PS9, Line 193: if (delta_iter_->memory_footprint() - delta_blocks_mem_before_prepare_batch > 0) Here and below: why to have this comparison with 0 here if UpdateMemTracker is already checking that usage isn't 0? Or the idea was to protect against negative values? http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@271 PS9, Line 271: // We only create a new redo delta file if we need to. What about tracking the usage of memory arena after the call to FilterColumnIdsAndCollectDeltas()? http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_store.h File src/kudu/tablet/delta_store.h: http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_store.h@479 PS9, Line 479: sizeof(PreparedDelta); AFAIK, PreparedDelta is a structure that contains Slice: struct PreparedDelta { DeltaKey key; Slice val; }; sizeof(Slice) might be minuscule compared with the size of the data that it actually points to, and DeltaKey is also small. Just do double check: so, here the idea is to track the memory footprint of the vector containing such structures, while the memory footprint of the actual data that Slices point at is accounted for somewhere else, right? -- To view, visit http://gerrit.cloudera.org:8080/22079 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2037582d433730884212e83359bb75bad0d5394 Gerrit-Change-Number: 22079 Gerrit-PatchSet: 9 Gerrit-Owner: Ashwani Raina <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Comment-Date: Fri, 17 Jan 2025 08:04:05 +0000 Gerrit-HasComments: Yes
