Re: [PR] rfc(100): update approach to remove reliance on column groups, break down plan [hudi]

via GitHub Thu, 05 Feb 2026 17:15:11 -0800


the-other-tim-brown commented on code in PR #18013:
URL: https://github.com/apache/hudi/pull/18013#discussion_r2771806450



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
 
 ### Building on Existing Foundation
 
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
 
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups 
across different column groups, enabling efficient storage of different data 
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi 
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system, 
providing the type foundation for unstructured data storage.
 
 ## Requirements 
 
 Below are the high-level requirements for this feature.
 
 1. Users must be able to define tables with a mix of structured (current 
types) and unstructured (blob type)
    columns
-2. Records are distributed across file groups like regular Hudi storage layout 
into file groups. But within each
-   file group, structured and unstructured columns are split into different 
column groups. This way the table can
-   also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the 
column group) or out-of-line (e.g
-   pointer to a multi-GB video file someplace). This decision should be made 
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the 
file) or out-of-line (e.g
+   pointer to a multi-GB video file someplace). Out-of-line references can 
also include a position within the file which 
+   allows multiple blobs to be stored within a single file to reduce the 
number of files in storage. 
+   This decision should be made dynamically during write/storage time.
 3. All table life-cycle operations and table services work seamlessly across 
both column types.for e.g cleaning
-   the file slices should reclaim both inline and out-of-line blob data. 
Clustering should be able re-organize
-   records across file groups or even redistribute columns across column 
groups within the same file group.
-4. Storage should support different column group distributions i.e different 
membership of columns
-   across column groups, across file groups, to ensure users or table services 
can flexibly reconfigure all this as
-   table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new 
columns are written to new column
-   groups or expand an existing column group within a file group.
+   the file slices should reclaim out-of-line blob data when the reference is 
managed by Hudi. 
+   Clustering should be able re-organize records across file groups or even 
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to 
store blobs inline or out-of-line
+   based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and 
materialize the values lazily to reduce memory pressure during shuffles.
 
 
 ## High-Level Design
 
-The design introduces a hybrid storage model where each file group can contain 
multiple column groups with different file formats optimized for their data 
types. Structured columns continue using 
-Parquet format, while unstructured columns can use specialized formats like 
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group 
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard 
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data 
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation 
time based on (per the current RFC-80). 
-But, ideally all these configurations should be automatic and Hudi should 
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
-  id BIGINT,
-  product_name STRING,
-  category STRING,
-  image BINARY,
-  video LARGE_BINARY,
-  embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
-  'hoodie.table.type' = 'MERGE_ON_READ',
-  'hoodie.bucket.index.hash.field' = 'id',
-  'hoodie.columngroup.structured' = 'id,product_name,category;id',
-  'hoodie.columngroup.images' = 'id,image;id',
-  'hoodie.columngroup.videos' = 'id,video;id',
-  'hoodie.columngroup.ml' = 'id,embeddings;id',
-  'hoodie.columngroup.images.format' = 'parquet',
-  'hoodie.columngroup.videos.format' = 'lance',
-  'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line 
storage of byte arrays that work seamlessly for the end user. Structured 
columns continue using 
+Parquet format, while unstructured data can use specialized formats like Lance 
or optimized Parquet configurations or HFile for random-access.
 
-**Storage Decision Logic**: During write time, Hudi determines storage 
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group 
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store 
locations with pointers in the main file, to avoid write amplification during 
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and 
out-of-line storage strategies. This will allow the user to use a mix of 
storage strategies seamlessly.
 
-
-**Storage Pointer Schema**:
+**Storage Schema**:
 ```json
 {
   "type": "record",
-  "name": "BlobPointer",
+  "name": "Blob",
   "fields": [
     {"name": "storage_type", "type": "string"},
-    {"name": "size", "type": "long"},
-    {"name": "checksum", "type": "string"},
-    {"name": "external_path", "type": ["null", "string"]},
-    {"name": "compression", "type": ["null", "string"]}
+    {"name": "data", "type": ["null", "bytes"]},
+    {"name": "reference", "type": ["null", {
+      "type": "record",
+      "name": "BlobReference",
+      "fields": [
+        {"name": "external_path", "type": "string"},
+        {"name": "position", "type": "long"},
+        {"name": "size", "type": "long"},
+        {"name": "managed", "type": "boolean"}
+      ]
+    }]}
   ]
 }
 ```
+The `managed` flag will be used by the cleaner to determine if an out-of-line 
blob should be deleted when cleaning up old file slices. This allows users to 
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when 
it is inline. This will help reduce memory pressure during shuffles in 
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path, 
position, and size. This applies for both inline and out-of-line storage.
+
+The reader user will then use a UserDefinedFunction (UDF), or similar 
abstraction based on the engine, to read the blob data from the reference when 
needed.
+
+### 3. Writer
+#### Phase 1: Basic Blob Support
+The writer will be updated to support writing blob data in both inline and 
out-of-line formats. 
+For out-of-line storage, the assumption is that the user will provide the 
external path, position, and size of the blob data and these references will 
not be managed by Hudi (they are not removed by the cleaner).
+In this phase, we will not implement dynamic inline/out-of-line storage based 
on size thresholds.
+
+The writer should be flexible enough to allow the user to pass in a dataset 
with blob data as simple byte arrays or records matching the Blob schema 
defined above.
+
+#### Phase 2: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic 
inline/out-of-line storage based on user configured size thresholds. The user 
will still be able to provide their own external path for out-of-line storage 
if desired.

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] rfc(100): update approach to remove reliance on column groups, break down plan [hudi]

Reply via email to