This is an automated email from the ASF dual-hosted git repository.

timbrown pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new 762fbea864e9 feat(blob): update approach to remove reliance on column 
groups, break down plan (#18013)
762fbea864e9 is described below

commit 762fbea864e9f79a814b0595b3445136484da507
Author: Tim Brown <[email protected]>
AuthorDate: Thu Feb 12 09:48:27 2026 -0500

    feat(blob): update approach to remove reliance on column groups, break down 
plan (#18013)
    
    * update approach to remove reliance on column groups, break down plan
    
    * fix graph, add 3rd approach for reference tracking
    
    * update based on feedback
    
    * update phases, lay out milestones
    
    * clarifications
    
    * update milestones to map more clearly back to sections, update sql 
function name, add WIP for compaction
    
    * lance limitations, link issues
    
    * add restriction
    
    * cleanup, update name
---
 rfc/rfc-100/rfc-100.md | 242 +++++++++++++++++++++++++++----------------------
 1 file changed, 134 insertions(+), 108 deletions(-)

diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md
index 0802f2f67935..5c129d763727 100644
--- a/rfc/rfc-100/rfc-100.md
+++ b/rfc/rfc-100/rfc-100.md
@@ -63,11 +63,9 @@ column that already exists.
 
 ### Building on Existing Foundation
 
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
 
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups 
across different column groups, enabling efficient storage of different data 
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi 
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system, 
providing the type foundation for unstructured data storage.
 
 ## Requirements 
 
@@ -75,141 +73,151 @@ Below are the high-level requirements for this feature.
 
 1. Users must be able to define tables with a mix of structured (current 
types) and unstructured (blob type)
    columns
-2. Records are distributed across file groups like regular Hudi storage layout 
into file groups. But within each
-   file group, structured and unstructured columns are split into different 
column groups. This way the table can
-   also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the 
column group) or out-of-line (e.g
-   pointer to a multi-GB video file someplace). This decision should be made 
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the 
file) or out-of-line (e.g
+   pointer to a multi-GB video file someplace). Out-of-line references can 
also include a (position, length) within the file which 
+   allows multiple blobs to be stored within a single file to reduce the 
number of files in storage. 
+   This decision should be made dynamically during write/storage time.
 3. All table life-cycle operations and table services work seamlessly across 
both column types.for e.g cleaning
-   the file slices should reclaim both inline and out-of-line blob data. 
Clustering should be able re-organize
-   records across file groups or even redistribute columns across column 
groups within the same file group.
-4. Storage should support different column group distributions i.e different 
membership of columns
-   across column groups, across file groups, to ensure users or table services 
can flexibly reconfigure all this as
-   table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new 
columns are written to new column
-   groups or expand an existing column group within a file group.
+   the file slices should reclaim out-of-line blob data when the referred blob 
is managed by Hudi. 
+   Clustering should be able re-organize records across file groups or even 
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to 
store blobs inline or out-of-line
+   based on size thresholds. Sane defaults should be supported for easy 
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored 
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and 
materialize the values lazily to reduce memory pressure and massive data 
exchange volumes during shuffles.
 
 
 ## High-Level Design
 
-The design introduces a hybrid storage model where each file group can contain 
multiple column groups with different file formats optimized for their data 
types. Structured columns continue using 
-Parquet format, while unstructured columns can use specialized formats like 
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line 
storage of byte arrays that work seamlessly for the end user. Structured 
columns continue using 
+any of the supported base-file formats, while unstructured data can use 
specialized formats like Lance or optimized Parquet configurations or simply a 
pointer to a byte range within a file.
+
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline 
and out-of-line storage strategies. This will allow the user to use a mix of 
storage strategies seamlessly.
+
+**Storage Schema**:
+```json
+{
+  "type": "record",
+  "name": "Blob",
+  "fields": [
+    {"name": "type", "type": "enum", "symbols": ["INLINE", "OUT_OF_LINE"]},
+    {"name": "data", "type": ["null", "bytes"]},
+    {"name": "reference", "type": ["null", {
+      "type": "record",
+      "name": "BlobReference",
+      "fields": [
+        {"name": "external_path", "type": "string"},
+        {"name": "offset", "type": "long"},
+        {"name": "length", "type": "long"},
+        {"name": "managed", "type": "boolean"}
+      ]
+    }]}
+  ]
+}
+```
+The `managed` flag will be used by the cleaner to determine if an out-of-line 
blob should be deleted when cleaning up old file slices. This allows users to 
point to existing files without Hudi deleting them.
+
+#### Restrictions
+We will not support adding blobs as Map values or Array elements in the 
initial implementation to reduce the complexity of the implementation for 
reading and managing blob references.
+Blobs can still be nested within Structs/Records to allow for complex schemas.
 
-### 1. Mixed Base File Format Support
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when 
it is inline. This will help reduce memory pressure during shuffles in 
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path, 
position, and size. This applies for both inline and out-of-line storage.
+The reference will be the latest for that row based on the table's defined 
merge mode. Similarly, when merging log and base files for compaction or 
clustering, the merge mode will define which blob reference is returned for 
that row just like the other columns.
 
-**Per-Column Group Format Selection**: Each column group within a file group 
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard 
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data 
or specially configured Parquet for BLOB storage
+We will provide the user with a prebuilt transform to effectively read the 
out-of-line blob data. For example, for Spark datasets we will leverage a 
Map-Partitions to batch requests to read blob data when the rows correspond to 
ranges within the same file. 
 
-**Format Configuration**: File format is determined at column group creation 
time based on (per the current RFC-80). 
-But, ideally all these configurations should be automatic and Hudi should 
auto-generate colum group names and mappings.
+For Spark SQL we will provide a function that the user can leverage to 
materialize the bytes from the blobs. Example syntax:
+```sql
+SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
+```
 
+### 3. Writer
+#### Phase 1: External Blob Support
+The writer will be updated to support writing blob data as out-of-line 
references. 
+For out-of-line storage, the assumption is that the user will provide the 
external path, position, and size of the blob data.
+In this phase, we will not implement inline storage or dynamic 
inline/out-of-line storage based on size thresholds.
 
+Users will be able to create tables with Spark SQL as well by defining custom 
DDL that allows them to specify a column as a BLOB type. Example syntax:
 ```sql
-CREATE TABLE multimedia_catalog (
-  id BIGINT,
-  product_name STRING,
-  category STRING,
-  image BINARY,
-  video LARGE_BINARY,
-  embeddings ARRAY<FLOAT>
-) USING HUDI
+CREATE TABLE my_table (
+    id STRING,
+    url STRING,
+    image_blob BLOB
+) USING hudi
 TBLPROPERTIES (
   'hoodie.table.type' = 'MERGE_ON_READ',
-  'hoodie.bucket.index.hash.field' = 'id',
-  'hoodie.columngroup.structured' = 'id,product_name,category;id',
-  'hoodie.columngroup.images' = 'id,image;id',
-  'hoodie.columngroup.videos' = 'id,video;id',
-  'hoodie.columngroup.ml' = 'id,embeddings;id',
-  'hoodie.columngroup.images.format' = 'parquet',
-  'hoodie.columngroup.videos.format' = 'lance',
-  'hoodie.columngroup.ml.format' = 'hfile'
+  'primaryKey ='id'
 )
 ```
 
-### 2. Dynamic Inline/Out-of-Line Storage
+### Phase 2: Inline Support
+The writer will be updated to support writing blob data as inline byte arrays. 
These byte arrays will be stored directly in the base file format configured 
for the table.
+The supported file formats will be optimized for inline blob storage by 
setting the proper configurations for these columns such as removing 
compression, setting the proper encoding, and disabling the column level 
statistics. 
 
-**Storage Decision Logic**: During write time, Hudi determines storage 
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group 
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store 
locations with pointers in the main file, to avoid write amplification during 
updates.
+The writer should be flexible enough to allow the user to pass in a dataset 
with blob data as simple byte arrays or records matching the Blob schema 
defined above.
 
+#### Phase 3: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic 
inline/out-of-line storage based on user configured size thresholds. The user 
will still be able to provide their own external path for out-of-line storage 
if desired.
+When the user provides blob data in the form of byte arrays, the writer will 
take arrays larger than the configured threshold and write them to files. The 
user can configure the file type used for this storage (e.g Parquet, HFile, 
Lance, etc).
+The writer will then generate the appropriate BlobReference for the 
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files 
created in storage. All of these blobs will belong to the same file group for 
ease of management.
 
-**Storage Pointer Schema**:
-```json
-{
-  "type": "record",
-  "name": "BlobPointer",
-  "fields": [
-    {"name": "storage_type", "type": "string"},
-    {"name": "size", "type": "long"},
-    {"name": "checksum", "type": "string"},
-    {"name": "external_path", "type": ["null", "string"]},
-    {"name": "compression", "type": ["null", "string"]}
-  ]
-}
-```
+**Configurations**: 
+- `hoodie.storage.blob.inline.threshold`: Size threshold in bytes for inline 
vs out-of-line storage.
+- `hoodie.storage.blob.outofline.container.maxElementSize`: Size threshold in 
bytes for blobs that can be stored within a container file. Blobs larger than 
this threshold will be stored in their own individual files.
+- `hoodie.storage.blob.outofline.container.maxFileSize`: Size threshold in 
bytes for maximum size of an out-of-line blob container file.
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line 
blob storage.
 
 **External Storage Layout**:
 ```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{column_name}/{instant}/{blob_id}
 ```
-Alternatively, User should be able to specify external storage location per 
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
 
+#### Writer Optimizations for Blob Storage
+##### Parquet
 For unstructured column groups using Parquet:
 - **Disable Compression**: Avoid double compression of already compressed 
media files
 - **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for 
BLOB columns
 - **Large Page Sizes**: Configure larger page sizes to optimize for sequential 
BLOB access
 - **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for 
efficient retrieval of a single blob value.
 - **Disable stats**: Not very useful for BLOB columns
-
-### 4. Lance Format Integration
-
-**Lance Advantages for Unstructured Data**:
+##### Lance
 - Native support for high-dimensional vectors and embeddings
 - Efficient columnar storage for mixed structured/unstructured data
 - Better compression for certain unstructured data types
 
-Supporting Lance, working across Hudi + Lance communities will help users 
unlock benefits of both currently supported 
-file formats in Hudi (parquet, orc), along with benefits of Lance. Over time, 
we could also incorporate newer emerging 
-file formats in the space and other well-established unstructured file formats.
-
-### 5. Enhanced Table Services
-
-**Cleaning Service Extensions**:
-- Track external BLOB references in metadata table
-- Implement cascading deletion of external BLOB files during cleaning
-- Add BLOB-specific retention policies, using reference counting to reclaim 
out-of-line blobs.
-
-**Compaction Service Extensions**:
-- Support cross-format compaction (merge Lance and Parquet column groups)
-- Implement BLOB deduplication during major compaction
-- Optimize external BLOB consolidation
-
-**Clustering Service Extensions**:
-- Enable redistribution of BLOB data across file groups
-- Support column group reconfiguration during clustering
-- Implement BLOB-aware data skipping strategies
-
-### 6. Flexible Column Group Management
-
-**Dynamic Column Group Creation**:
-```java
-// Writer API extensions
-HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
-  .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
-  .withNewColumnGroupThreshold(100_000_000L) // 100MB
-  .withBlobStorageThreshold(1_048_576L) // 1MB
-  .build();
-```
+Track Lance Integration progress in this issue: 
https://github.com/apache/hudi/issues/14127
+
+Current Limitations in Lance:
+- Cannot lazily read back the blob references in the Java reader: 
https://github.com/lance-format/lance/issues/5167
+- No ability to write lance format to java input/output streams for log blocks
+
+Over time, we could also incorporate newer emerging file formats in the space 
and other well-established unstructured file formats.
+
+### 4. Table Services
+#### Cleaning
+The cleaning service will be updated to identify the out-of-line blob 
references that are managed by Hudi and no longer referenced by any active file 
slices.
+To identify these references, we have three options:
+1. Scan all active file slices to build a set of referenced blob IDs and then 
scan the file slices being removed to identify references in the removed slices 
that are not in the active set.
+2. Maintain metadata on the blob references contained in the file in the 
footer or metadata section of each base and log file. The cleaner can then read 
this metadata to identify blob references in the removed slices and check if 
they are still referenced in active slices.
+3. Maintain an index in the metadata table that tracks all blob references and 
their reference counts. The cleaner can then use this index to identify 
unreferenced blobs.
 
-**Column Group Reconfiguration**:
-- Support splitting existing column groups when they grow too large
-- Enable merging small column groups during maintenance operations
-- Allow migration of columns between column groups
+**Option 1 will be implemented in milestone 1.**
 
-### 7. Query Engine Integration
+**Note**: This is only required for out-of-line blobs that are managed by 
Hudi. Out-of-line blobs that are not managed by Hudi will not be deleted by the 
cleaner. This will be part of `Phase 2` of the writer implementation.
+
+#### [WIP] Blob Compaction
+We will introduce a new form of compaction that allows for repacking of 
out-of-line blobs managed by Hudi to reduce the number of files in storage.
+The compaction will scan the out-of-line blob references that are active 
within the file group and repack them into new container files based on a user 
configured target file size.
+The repacking will also pack these blobs based on the base file row's ordering 
to improve read locality.
+
+#### Clustering
+Since the out-of-line blobs are part of the file group, managed references 
will need to be updated as part of the clustering operation.
+The clustering will need to read the blob references from the source file 
groups and rewrite them to new target file groups. 
+These new files will be created in the same manner as the writer, using the 
configured inline/out-of-line storage strategy.
+
+### 5. Query Engine Integration
 
 **Spark Integration**:
 - Extend DataSource API to handle mixed column group formats
@@ -222,12 +230,30 @@ HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
 - Efficient BLOB streaming for distributed ML workloads
 - Integration with Ray's object store for large BLOB caching
 
-### 8. Metadata Table Extensions
+### 6. Potential Optimizations and Future Work
+
+- Store & maintain indexes for out-of-line blob storage for faster lookups
+- Optimize query planning to minimize number of out-of-line blob reads during 
query execution
+
+## Development Plan
+
+#### Milestone 1: External Blob Support
+At the end of milestone 1, the user will be able to store blobs as references 
to files and read the data back from those references through Spark dataframes 
or SQL as described in the Reader support above. More details can be found in 
Write Phase 1 and Read Support sections above. 
+If the blob is marked as managed, the cleaner will clean it up when all 
references to that blob are removed. We will use the first approach described 
above in this milestone.
+
+#### Milestone 2: Inline Blob Support
+At the end of milestone 2, the user will be able to store blobs as byte arrays 
directly in the base file format. This will include optimizations at the base 
file writer to better handle these large byte array fields. The developer 
experience should be improved to allow the user to pass the byte arrays 
directly without having to construct the full blob struct defined above. More 
details can be found in Write Phase 2 above.
+
+#### Milestone 3: Dynamic Inline/Out-of-Line Storage
+At the end of milestone 3, Hudi will be able to pack blobs into container 
files for the user based on the configured thresholds. The user will also be 
able to configure the file format used for out-of-line blob storage. The 
cleaner support added in milestone 1 will support cleaning these container 
files when the blobs within them are no longer referenced.
+As part of this milestone, a new blob compaction service will be added that 
will allow for repacking of out-of-line blobs into new container files to 
reduce the number of files in storage.
+More details can be found in Write Phase 3 and the Table Services section 
above.
 
-- Track BLOB references for garbage collection
-- Store maintain indexes for parquet based blob storage
-- Maintain size statistics for storage optimization
-- Support BLOB-based query optimization
+#### Milestone 4: Optimization
+After the foundational work is done, there will be opportunities for further 
optimizations such as:
+- Lazily reading inline blob data by returning a reference to that blob's data 
within the base file.
+- Ensuring that the Spark SQL plan is optimized to minimize the number of blob 
values that need to be materialized during query execution. For e.g if a filter 
is applied on a non-blob column, we should try to apply that filter before 
materializing any blob values.
+- Implementing a metadata index for blob references to speed up blob retrieval 
and cleaning
 
 ## Rollout/Adoption Plan
 

Reply via email to