[
https://issues.apache.org/jira/browse/KUDU-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650003#comment-17650003
]
ASF subversion and git services commented on KUDU-3406:
-------------------------------------------------------
Commit 1556a353e60c5d555996347cbd46d5e5a6661266 in kudu's branch
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=1556a353e ]
KUDU-3406 memory budgeting for CompactRowSetsOp
This patch implements memory budgeting for performing rowset merge
compactions (i.e. CompactRowSetsOp maintenance operations).
The idea is to check whether it's enough memory left before reaching the
hard memory limit if starting a CompactRowSetsOp. An estimate for the
amount of memory necessary to perform the operation is based on the
total on-disk size of all deltas in rowsets selected for the merge
compaction and the ratio of memory-to-disk size when loading those
deltas in memory to perform the merge rowset compaction. If there is
enough memory, then a rowset is considered as an input for merge
compaction, otherwise it's not. Meanwhile, REDO deltas are becoming
UNDO deltas after major delta compactions run on the rowset, and UNDO
deltas eventually become ancient, so UndoDeltaBlockGCOp drop those.
With that, the amount of memory required to load a rowset's delta data
into memory shrinks over long run, and eventually it's back into the
input for one of the future runs of the CompactRowSetsOp maintenance
operation.
Prior to this patch, the root cause of running out of memory when
performing CompactRowSetsOp was trying to allocate too much memory
at least due to the following factors:
* many UNDO deltas might accumulate in rowsets selected for the
compaction operation because of the relatively high setting for the
--tablet_history_max_age_sec flag (7 days) and a particular workload
that issues many updates for rows in the same rowset
* even if it's a merge-like operation by its nature, the current
implementation of CompactRowSetsOp allocates all the memory
necessary to load the UNDO deltas at once, and it keeps all the
preliminary results in the memory as well before persisting
the result data to disk
* the current implementation of CompactRowSetsOp loads all the UNDO
deltas from the rowsets selected for compaction regardless whether
they are ancient or not; it discards of the data sourced from the
ancient deltas in the very end before persisting the result data
Ideally, the current implementation of CompactRowSetsOp should be
refactored to merge the deltas in participating rowsets sequentially,
chunk by chunk, persisting the results and allocating memory just for
small bunch of processed deltas, not loading all the deltas at once.
A future patch should take care of that, while this patch provides an
interim approach using memory budgeting on top of the current
CompactRowSetsOp implementation as-is.
The newly introduced behavior is gated by the following two flags:
* rowset_compaction_memory_estimate_enabled: whether to enable memory
budgeting for CompactRowSetsOp (default is 'false').
* rowset_compaction_ancient_delta_threshold_enabled: whether to
check against the ratio of ancient UNDO deltas across rowsets
selected for compaction (default is 'true').
In addition, the following flags allow for tweaking the new
behavior gated by the corresponding flags above:
* rowset_compaction_delta_memory_factor: the multiplication factor for
the total size of rowset's deltas to estimate how much memory
CompactRowSetsOp would consume if operating on those deltas when
no runtime stats for the compact_rs_mem_usage_to_deltas_size_ratio
metric is yet available (default is 5.0)
* rowset_compaction_ancient_delta_max_ratio: the threshold for the
ratio of the data size in ancient UNDO deltas to the total data size
of UNDO deltas in the rowsets selected for merge compaction
* rowset_compaction_estimate_min_deltas_size_mb: the threshold on the
total size of a rowset's deltas to apply the memory budgeting
To complement the --rowset_compaction_delta_memory_factor flag with more
tablet-specific stats, two new per-tablet metrics have been introduced:
* compact_rs_mem_usage is a histogram to gather statistics on how much
memory rowset merge compaction consumed
* compact_rs_mem_usage_to_deltas_size_ratio is a histogram to track
the memory-to-disk size for a tablet's rowsets participating in
merge compaction -- this metric provides the average that's is used
as a more precise factor to estimate the amount of memory a rowset's
deltas would use when undergoing merge compaction given the amount
of memory of all the rowset's deltas on disk
This patch doesn't add a test, but I verified how the new functionality
works with real data from the case when merge rowset compaction would
take about 28GByte if not constrained by the memory limit. I'm planning
to add a test in a follow-up changelist based on the following patch
once the latter appears in the git repository:
https://gerrit.cloudera.org/#/c/19278
Change-Id: I89c171284944831e95c45a993d85fbefe89048cf
Reviewed-on: http://gerrit.cloudera.org:8080/19281
Reviewed-by: Attila Bukor <[email protected]>
Tested-by: Kudu Jenkins
> CompactRowSetsOp can allocate much more memory than specified by the hard
> memory limit
> --------------------------------------------------------------------------------------
>
> Key: KUDU-3406
> URL: https://issues.apache.org/jira/browse/KUDU-3406
> Project: Kudu
> Issue Type: Bug
> Components: master, tserver
> Affects Versions: 1.13.0, 1.14.0, 1.15.0, 1.16.0
> Reporter: Alexey Serbin
> Assignee: Ashwani Raina
> Priority: Critical
> Labels: compaction, stability
> Attachments: 270.svg, 283.svg, 296.svg, 308.svg, 332.svg, 344.svg,
> fs_list.before
>
>
> In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}}
> and {{kudu-tserver}} processes grow far beyond the hard memory limit
> (controlled by the {{\-\-memory_limit_hard_bytes}} flag) when running
> CompactRowSetsOp. In some cases, a Kudu server process consumes all the
> available memory, so that the OS might invoke the OOM killer.
> At this point I'm not yet sure about the exact versions affected, and what
> leads to accumulating so much data in flushed rowsets, but I know that 1.13,
> 1.14, 1.15 and 1.16 are affected. It's also not clear whether the actual
> regression is in allowing the flushed rowsets growing that big.
> There is a reproduction scenario for this bug with {{kudu-master}} using the
> real data from the fields. With that data, {{kudu fs list}} reveals a rowset
> with many UNDOs: see the attached {{fs_list.before}} file. When starting
> {{kudu-master}} with the data, the process memory usage eventually peaked
> with about 25GByte of RSS while running CompactRowSetsOp, and then the RSS
> eventually subsides down to about 200MByte once the CompactRowSetsOp is
> completed.
> I also attached several SVG files generated by the TCMalloc's pprof from the
> memory profile snapshots output by {{kudu-master}} when configured to dump
> allocation stats every 512 MBytes. I generated the SVG reports for profiles
> attributed to the highest memory usage:
> {noformat}
> Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB
> currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB
> allocated cumulatively, 13221 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB
> allocated cumulatively, 12110 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB
> allocated cumulatively, 12406 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB
> allocated cumulatively, 23884 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB
> allocated cumulatively, 12648 MB currently in use)
> {noformat}
> The report from the compaction doesn't look like anything extraordinary
> (except for the duration):
> {noformat}
> I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P
> 68dbea0ec022440d9fc282099a8656cb:
> CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real
> 522.617s user 471.783s sys 46.588s Metrics:
> {"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307}
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)