[
https://issues.apache.org/jira/browse/KUDU-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677822#comment-17677822
]
ASF subversion and git services commented on KUDU-3406:
-------------------------------------------------------
Commit ef2a5c39dff75736c11ae51a9b86ba943796e873 in kudu's branch
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=ef2a5c39d ]
KUDU-3406 corrected estimate for ancient UNDO delta size
When looking into micro-benchmark results produced by the
$KUDU_HOME/src/kudu/scripts/benchmarks.sh script, I noticed that
dense_node-itest showed 9-fold increase in the number of blocks under
management. Even if the test disables GC of the ancient UNDO deltas
(i.e. it runs with --enable_undo_delta_block_gc=false), that's not the
expected behavior. It turned out the issue was in the way how
DeltaTracker::EstimateBytesInPotentiallyAncientUndoDeltas() operated: it
always treated a delta to be ancient if no stats were present. So, if a
delta file was lazily loaded without stats being read, DeltaTracker
assumed all its deltas were ancient. With the new behavior introduced
in 1556a353e, it led to rowset merge compactions skipping the newly
generated UNDO deltas since the estimate reported 100% of those deltas
being ancient.
While this was not detected by prior testing using various real-world
scenarios involving some tangible amount of data written, tracking
the history of stats emitted by dense_node-itest allowed to spot the
issue.
This patch addresses the issue, introducing different estimate modes for
the method mentioned above and using proper modes in various contexts.
I verified that with this patch added, the benchmark based on
dense_node-itest now reports the number of blocks as it has been
reporting for the longest span of its history. So I'm not adding any
new tests with this patch.
This is a follow-up to 1556a353e60c5d555996347cbd46d5e5a6661266.
Change-Id: I17bddae86f84792caf14fb1e11a6e1c0d7a92b56
Reviewed-on: http://gerrit.cloudera.org:8080/19413
Tested-by: Kudu Jenkins
Reviewed-by: Attila Bukor <[email protected]>
> CompactRowSetsOp can allocate much more memory than specified by the hard
> memory limit
> --------------------------------------------------------------------------------------
>
> Key: KUDU-3406
> URL: https://issues.apache.org/jira/browse/KUDU-3406
> Project: Kudu
> Issue Type: Bug
> Components: master, tserver
> Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0, 1.14.0,
> 1.15.0, 1.16.0
> Reporter: Alexey Serbin
> Assignee: Ashwani Raina
> Priority: Critical
> Labels: compaction, stability
> Fix For: 1.17.0
>
> Attachments: 270.svg, 283.svg, 296.svg, 308.svg, 332.svg, 344.svg,
> fs_list.before
>
>
> In some scenarios, rowsets can accumulate a lot of data, so {{kudu-master}}
> and {{kudu-tserver}} processes grow far beyond the hard memory limit
> (controlled by the {{\-\-memory_limit_hard_bytes}} flag) when running
> CompactRowSetsOp. In some cases, a Kudu server process consumes all the
> available memory, so that the OS might invoke the OOM killer.
> At this point I'm not yet sure about the exact versions affected, and what
> leads to accumulating so much data in flushed rowsets, but I know that 1.13,
> 1.14, 1.15 and 1.16 are affected. It's also not clear whether the actual
> regression is in allowing the flushed rowsets growing that big.
> There is a reproduction scenario for this bug with {{kudu-master}} using the
> real data from the fields. With that data, {{kudu fs list}} reveals a rowset
> with many UNDOs: see the attached {{fs_list.before}} file. When starting
> {{kudu-master}} with the data, the process memory usage eventually peaked
> with about 25GByte of RSS while running CompactRowSetsOp, and then the RSS
> eventually subsides down to about 200MByte once the CompactRowSetsOp is
> completed.
> I also attached several SVG files generated by the TCMalloc's pprof from the
> memory profile snapshots output by {{kudu-master}} when configured to dump
> allocation stats every 512 MBytes. I generated the SVG reports for profiles
> attributed to the highest memory usage:
> {noformat}
> Dumping heap profile to /opt/tmp/master/nn1/profile.0270.heap (24573 MB
> currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0283.heap (64594 MB
> allocated cumulatively, 13221 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0296.heap (77908 MB
> allocated cumulatively, 12110 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0308.heap (90197 MB
> allocated cumulatively, 12406 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0332.heap (114775 MB
> allocated cumulatively, 23884 MB currently in use)
> Dumping heap profile to /opt/tmp/master/nn1/profile.0344.heap (127064 MB
> allocated cumulatively, 12648 MB currently in use)
> {noformat}
> The report from the compaction doesn't look like anything extraordinary
> (except for the duration):
> {noformat}
> I20221012 10:45:49.684247 101750 maintenance_manager.cc:603] P
> 68dbea0ec022440d9fc282099a8656cb:
> CompactRowSetsOp(00000000000000000000000000000000) complete. Timing: real
> 522.617s user 471.783s sys 46.588s Metrics:
> {"bytes_written":1665145,"cfile_cache_hit":846,"cfile_cache_hit_bytes":14723646,"cfile_cache_miss":1786556,"cfile_cache_miss_bytes":4065589152,"cfile_init":7,"delta_iterators_relevant":1558,"dirs.queue_time_us":220086,"dirs.run_cpu_time_us":89219,"dirs.run_wall_time_us":89163,"drs_written":1,"fdatasync":15,"fdatasync_us":150709,"lbm_read_time_us":11120726,"lbm_reads_1-10_ms":1,"lbm_reads_lt_1ms":1786583,"lbm_write_time_us":14120016,"lbm_writes_1-10_ms":3,"lbm_writes_lt_1ms":894069,"mutex_wait_us":108,"num_input_rowsets":5,"rows_written":4043,"spinlock_wait_cycles":14720,"thread_start_us":741,"threads_started":9,"wal-append.queue_time_us":307}
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)