Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/22079 )
Change subject: [compaction] Add memory tracking for better OOM issues triaging ...................................................................... Patch Set 11: (9 comments) http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.h File src/kudu/tablet/delta_compaction.h: http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.h@102 PS10, Line 102: void MemoryExceededWarnMsgs(); nit: add a comment for the new method, so it's documented similar to the rest of the methods in this class http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.h@104 PS10, Line 104: enum MemTrackerAction { : CONSUME, : RELEASE : }; Is this still needed? http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.h@109 PS10, Line 109: // Update memory tracker for compaction operation for the tablet under process. nit: does it make sense to mention about logging warning messages as well? http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.h@111 PS10, Line 111: FsManager* const fs_manager_; nit: add an empty line to separate the declaration of the UpdateMemTracker() member function from the declaration of the 'fs_manager_' member field. http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@193 PS9, Line 193: > OK, them maybe then get rid of those 'if(...)' in the code and just call 'U s/them// http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_compaction.cc@130 PS10, Line 130: if (process_memory::UnderMemoryPressure(&capacity_pct)) { > The main reason behind comparing the current consumption against > memory_limit_warn_threshold_percentage is to ensure we start emitting warning > messages before breaching hard limit. OK, but what can be done even if those messages are emitted before breaching the hard limit? As you mentioned in your message at https://gerrit.cloudera.org/#/c/22079/9/src/kudu/tablet/delta_compaction.cc@132, there isn't anything actionable here and the code doesn't behave any specific way if hitting such a condition: > This patch does not take any action to avoid OOM condition or suggest > operators to take any action. > Isn't it better to receive such warnings early on rather than late when hard > limit has already been crossed? Since there isn't any possible action to take (as per information above), and the compaction op continues running regardless of breaching hard memory limit, we are talking about post-mortem analysis only, right? With that, there isn't any difference how preemptive those warnings are. IMO, the main point to focus would rather be adding as less noise and false positives, as possible. My point: if this patch is trying to help pin-pointing conditions of going over the hard memory due to memory over-allocation by compaction activity with the risk of the OOM killer's involvement, there should be less noise and false positives around such warnings. If detecting memory usage being close to the hard memory within the threshold defined by --memory_limit_warn_threshold_percentage, how to tell that the condition has been reached due to compaction logic allocating too much memory and not something else (e.g., fuzziness in memory accounting by MemoryTrackers, slow IO while flushing the data to disk, etc.)? By issuing a warning only when the hard memory is overrun by significant margin would at least allow for limiting the scope of these warnings to the deficiencies in memory allocation in major delta and rowset merge compactions: those are the places we know it can run amok by a large margin, while all the other places honor the hard memory limit, IIUC. However, I might be missing something or simply misunderstanding the purpose of this patch. What do you think? http://gerrit.cloudera.org:8080/#/c/22079/11/src/kudu/tablet/delta_compaction.cc File src/kudu/tablet/delta_compaction.cc: http://gerrit.cloudera.org:8080/#/c/22079/11/src/kudu/tablet/delta_compaction.cc@133 PS11, Line 133: "MajorDeltaCompaction op may not complete due to lack of sufficient memory " : "(at %.2f%% of capacity), current compaction consumption: %ld, total consumption " : "for all running major delta compaction ops: %ld, tablet: %s" Maybe, be more direct of the expected OOM condition in this message? The phrasing of 'MajorDeltaCompaction op may not complete' assumes only such operations are affected, and also hints those ops are somehow aborted/cancelled, but that's not a case at all. http://gerrit.cloudera.org:8080/#/c/22079/11/src/kudu/tablet/delta_compaction.cc@176 PS11, Line 176: // 1) Get the next batch of base data for the columns we're compacting. Does it make sense to UpdateMemTracker() with the arena's memory usage when it's being reset at line 177? Otherwise, the left-overs from prior iteration would be accounted twice at line 192 of PS11, right? http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_store.h File src/kudu/tablet/delta_store.h: http://gerrit.cloudera.org:8080/#/c/22079/10/src/kudu/tablet/delta_store.h@479 PS10, Line 479: int64_t slice_data_size_total = 0; : int64_t delta_obj_size_total = prepared_deltas_.size() * sizeof(PreparedDelta); nit: could we compute the result of this function/method using just one temporary variable instead of two? Having just one variable and adding a comment could explain what's going on here much better than two variables having names that might be cryptic to a reader without much of the context. -- To view, visit http://gerrit.cloudera.org:8080/22079 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2037582d433730884212e83359bb75bad0d5394 Gerrit-Change-Number: 22079 Gerrit-PatchSet: 11 Gerrit-Owner: Ashwani Raina <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Comment-Date: Fri, 24 Jan 2025 02:48:22 +0000 Gerrit-HasComments: Yes
