Mike Percy has submitted this change and it was merged.

Change subject: KUDU-1601. Delete ancient UNDO delta blocks in the background
......................................................................


KUDU-1601. Delete ancient UNDO delta blocks in the background

This patch adds a maintenance manager background task that deletes
"ancient" UNDO delta blocks, which means blocks that correspond to data
that is considered no longer reachable and a candidate for garbage
collection. The task only deletes entire blocks and so does not provoke
write amplification.

This maintenance task operates in the following way:

1. UpdateStats() returns the maximum potentially gc'able bytes of undos
   in the rowset, which is the sum of all undo delta store sizes up
   until an initialized one with max_timestamp > the AHM (ancient
   history mark). The accuracy of this estimate improves over time, and
   in a steady state will be exact, as undo delta blocks are initialized
   while running Perform().

2. Perform() initializes undo delta stores for the tablet for some
   budgeted amount of time. Per rowset it initializes undo delta stores
   with its budget until it finds the earliest one with max_timestamp >
   AHM. That makes the next UpdateStats() call more accurate. Once it
   has exhausted its time budget, or has initialized all ancient undo
   blocks, it garbage-collects all of the known ancient undo delta
   blocks in the tablet.

To avoid starvation of performance improvement maintenance ops, a new
flag named --data_gc_prioritization_prob has been introduced that
incorporates some randomness into the scheduler at the maintenance
manager level. This controls the fraction of the time that the scheduler
considers data GC ops higher priority than performance improvement ops.

This patch includes the following:

* New UNDO delta block GC MM task
* New UNDO delta block GC metrics (at the tablet level only)
* Flags to enable / disable the GC task as well as flags to throttle it
* A few minor improvements in the maintenance manager
* Fixes for a few preexisting clang-tidy lint complaints

Notable implementation details:

* When performing undo delta GC in Tablet::DeleteAncientUndoDeltas(), we
  only flush the tablet metadata after making the metadata changes
  across all rowsets. This is safe because we are not actually modifying
  any data, we are simply removing references to blocks that are no
  longer reachable by new scanners. The code path that handles the
  metadata update for compactions and ancient history data GC,
  DeltaTracker::CommitDeltaStoreMetadataUpdate(), has a DCHECK in place
  to ensure that it is never called without specifying blocks to remove.
  This guarantees that the DeltaMemStore flush code path located in
  DeltaTracker::FlushDMS(), the only delta-related code path that
  modifies user-visible data, does not utilize that routine for its
  flush. This fact was also verified by inspection -- FlushDMS()
  contains its own flush code path.

Includes the following tests:

* RowSet-level unit test in diskrowset-test
* Tablet-level functional test in tablet_history_gc-test
* Tablet-level concurrency test in mt-tablet-test
* Integration test utilizing the tserver-level MM task in
  tablet_history_gc-itest
* Incorporated into RandomizedTabletHistoryGcITest in
  tablet_history_gc-itest

Manual testing:

* I ran 300 iterations of TabletHistoryGcITest.TestUndoDeltaBlockGc on
  the dist-test cluster under TSAN with 12 stress threads:
  http://dist-test.cloudera.org/job?job_id=mpercy.1487901212.3733

* I also ran YCSB on a 10-node cluster on a table with 200 tablets with
  mostly default parameters except for --tablet_history_max_age_sec=60.
  YCSB was configured like so:

    recordcount=100000
    operationcount=6000000
    updateproportion=1.0
    requestdistribution=zipfian
    threadcount=10
    kudu_pre_split_num_tablets=200
    kudu_sync_ops=true

  This workload took 839 seconds to run and I did not observe an average
  update latency increase over time (there was a mild sawtooth pattern),
  which indicated to me that the compaction operations were keeping up
  with the updates. The undo delta GC operations were also keeping pace
  and garbage was being collected aggressively, with generally only tens
  of MB, or less, of reclaimable data per tablet being present at any
  given time. It seems the current defaults are reasonable, although
  additional performance testing is likely warranted.

Change-Id: I0309bf7acfb6d018860c80f354012c3500da5c68
Reviewed-on: http://gerrit.cloudera.org:8080/4363
Tested-by: Mike Percy <mpe...@apache.org>
Reviewed-by: David Ribeiro Alves <dral...@apache.org>
---
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/master/master.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/delta_tracker.h
M src/kudu/tablet/deltafile.cc
M src/kudu/tablet/deltafile.h
M src/kudu/tablet/diskrowset-test.cc
M src/kudu/tablet/diskrowset.cc
M src/kudu/tablet/diskrowset.h
M src/kudu/tablet/memrowset.h
M src/kudu/tablet/mock-rowsets.h
M src/kudu/tablet/mt-tablet-test.cc
M src/kudu/tablet/rowset.h
M src/kudu/tablet/rowset_metadata.cc
M src/kudu/tablet/rowset_metadata.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_history_gc-test.cc
M src/kudu/tablet/tablet_metrics.cc
M src/kudu/tablet/tablet_metrics.h
M src/kudu/tablet/tablet_mm_ops.cc
M src/kudu/tablet/tablet_mm_ops.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/util/maintenance_manager-test.cc
M src/kudu/util/maintenance_manager.cc
M src/kudu/util/maintenance_manager.h
26 files changed, 1,352 insertions(+), 143 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Mike Percy: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4363
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I0309bf7acfb6d018860c80f354012c3500da5c68
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot

Reply via email to