devabhishekpal commented on code in PR #9793: URL: https://github.com/apache/ozone/pull/9793#discussion_r2854616371
########## hadoop-hdds/docs/content/design/mpu-gc-optimization.md: ########## @@ -0,0 +1,640 @@ +--- +title: Multipart Upload GC Pressure Optimizations +summary: Change Multipart Upload Logic to improve OM GC Pressure +date: 2026-02-19 +jira: HDDS-10611 +status: proposed +author: Abhishek Pal, Rakesh Radhakrishnan +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Ozone MPU Optimization - Design Doc + + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Proposal](#2-proposal) + * [Split-table design (V2)](#split-table-design-v2) + * [Comparison: V1 (legacy) vs V2](#comparison-v1-legacy-vs-v2) + * [2.1 Data Layout Changes](#21-data-layout-changes) + * [2.2 MPU Flow Changes](#22-mpu-flow-changes) + * [2.3 Summary and Trade-offs](#23-summary-and-trade-offs) +3. [Upgrades](#3-upgrades) +4. [Industry Patterns](#4-industry-patterns-flattened-keys-in-lsmrocksdb-systems) +--- + +## 1. Motivation +Presently Ozone has several overheads when uploading large files via Multipart upload (MPU). This document presents a detailed design for optimizing the MPU storage layout to reduce these overheads. + +### Problem with the current MPU schema +**Current design:** +* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}` +* Value = full `OmMultipartKeyInfo` with all parts inline. + +**Implications:** +1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, adds one part, serializes it, and writes it back (HDDS-10611). +2. RocksDB WAL logs each full write → WAL growth (HDDS-8238). +3. GC pressure grows with the size of the object (HDDS-10611). + +#### a) Deserialization overhead +| Operation | Current | +|:--------------|:--------------------------------------------------------| +| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) | + +#### b) WAL overhead +Assuming one MPU part info object takes ~1.5KB. + +| Scenario | Current WAL | +|:------------|:--------------------------------| +| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB | + +#### c) GC pressure +Current: Large short-lived objects per part commit. + +#### Existing Storage Layout Overview +```protobuf +MultipartKeyInfo { + uploadID : string + creationTime : uint64 + type : ReplicationType + factor : ReplicationFactor (optional) + partKeyInfoList : repeated PartKeyInfo ← grows with each part + objectID : uint64 (optional) + updateID : uint64 (optional) + parentID : uint64 (optional) + ecReplicationConfig : optional +} +``` + +--- + +## 2. Proposal +The idea is to split the content of `MultipartInfoTable`. Part information will be stored separately in a flattened schema (one row per part) instead of one giant object. + +### Split-table design (V2) +Split MPU metadata into: +* **Metadata table:** Lightweight per-MPU metadata (no part list). +* **Parts table:** One row per part (flat structure). + +**New MultipartPartInfo Structure:** Review Comment: I think here the only possible candidate for deprecation would be the `partKeyInfoList`. I have marked it as deprecated in the latest commit, and also pulled in the common metadata from the parts information message (MultipartPartInfo). I think the checksum and file encryption info is required on every individual part since the S3 specs specify that each part needs to have encryption info (specially for customer provided keys, encryption is to be provided for every part. Default server side encryption is the only case where S3 sets this information in the create request and the same is applied on every UploadPart request) Similarly, the checksum is also required on every part. The dataSize is actually used for overwritten part scenario where each part can be overwritten. In this case we need to know the current part size, subtract that and then add the new part size. If we pull this up to the metadata we will lose information on the size of the last part which we would require to overwrite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
