[
https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin updated KUDU-3195:
--------------------------------
Code Review: https://gerrit.cloudera.org/#/c/16581/
> Make DMS flush policy more robust when maintenance threads are idle
> -------------------------------------------------------------------
>
> Key: KUDU-3195
> URL: https://issues.apache.org/jira/browse/KUDU-3195
> Project: Kudu
> Issue Type: Improvement
> Components: tserver
> Affects Versions: 1.13.0
> Reporter: Alexey Serbin
> Priority: Major
>
> In one scenario I observed very long bootstrap times of tablet servers
> (something between to 45 minutes and 60 minutes) even if tablet servers had
> relatively small amount of data under management (~80GByte). It turned out
> the time was spent on replaying WAL segments, with {{kudu cluster ksck}}
> reporting something like below all the time during bootstrap:
> {noformat}
> b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running
> State: BOOTSTRAPPING
> Data state: TABLET_DATA_READY
> Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this
> segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657}
> inserts{seen=5949247
> ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7)
> {noformat}
> The workload I ran before shutting down the tablet servers consisted of many
> small UPSERT operations, but the cluster was idle after terminating the
> workload for long time (about few hours or so). The workload was generated by
> {noformat}
> kudu perf loadgen \
> --table_name=$TABLE_NAME \
> --num_rows_per_thread=800000000 \
> --num_threads=4 \
> --use_upsert \
> --use_random_pk \
> $MASTER_ADDR
> {noformat}
> The table that the UPSERT workload was running against had been pre-populated
> by the following:
> {noformat}
> kudu perf loadgen --table_num_replicas=3 --keep-auto-table
> --table_num_hash_partitions=5 --table_num_range_partitions=5
> --num_rows_per_thread=800000000 --num_threads=4 $MASTER_ADDR
> {noformat}
> As it turned out, tablet servers accumulated huge number of DMS which
> required flushing/compaction, but after the memory pressure subsided, the
> compaction policy was scheduling just one operation per tablet in every 120
> seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}).
> In fact, tablet servers could flush those rowsets non-stop since the
> maintenance threads were completely idle otherwise and there were no active
> workload running against the cluster. Those DMS has been around for long
> time (much more than 120 seconds) and were anchoring a lot of WAL segments.
> So, the operations from the WAL had to be replayed once I restarted the
> tablet servers.
> It would be great to update the flushing/compaction policy to allow tablet
> servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than
> specified by {{\-\-flush_threshold_secs}} when the maintenance threads are
> not busy otherwise.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)