Re: [PR] HDDS-10611. Design document for MPU GC Optimization [ozone]

via GitHub Sat, 21 Feb 2026 01:17:50 -0800


ivandika3 commented on code in PR #9793:
URL: https://github.com/apache/ozone/pull/9793#discussion_r2835983826



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.

Review Comment:
   We can add additional sections at the end regarding other systems (industry 
practice) that having a flattened schema on RocksDB is a common practice in 
RocksDB. For example, in the MVCC setup in CockroachDB or TiKV, usually the key 
has the suffix about the timestamp or version.
   
   In the future, what we learnt from this feature can be applied when we want 
to support S3 versioning (which should also have a flat schema).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10611. Design document for MPU GC Optimization [ozone]

Reply via email to