vinothchandar commented on code in PR #18013:
URL: https://github.com/apache/hudi/pull/18013#discussion_r2790735444
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,151 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a (position, length) within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the referred blob
is managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds. Sane defaults should be supported for easy
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure and massive data
exchange volumes during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-### 1. Mixed Base File Format Support
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline
and out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+**Storage Schema**:
+```json
+{
+ "type": "record",
+ "name": "Blob",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "offset", "type": "long"},
+ {"name": "length", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
+ ]
+}
+```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+We will provide the user with a prebuilt transform to effectively read the
out-of-line blob data. For example, for Spark datasets we will leverage a
Map-Partitions to batch requests to read blob data when the rows correspond to
ranges within the same file.
+For Spark SQL we will provide a function that the user can leverage to
materialize the bytes from the blobs. Example syntax:
```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
+SELECT id, url, resolve_blob(image_blob) as image_bytes FROM my_table;
+```
+
+### 3. Writer
+#### Phase 1: External Blob Support
Review Comment:
how do phases relate to milestones below?
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
Review Comment:
okay my question is different. Above we stated splitting structured and
unstructured columns presumably within the same table? In this case, would a
file slice have two base file formats?
(or)
What you intended was around the "container" or "packed" files approach to
store blobs alone differently?
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "position", "type": "long"},
+ {"name": "size", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
]
}
```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
Review Comment:
okay. we are focussing on CoW for lance, for now. I take thats the
resolution here for now
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,152 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a (position, length) within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the referred blob
is managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds. Sane defaults should be supported for easy
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure and massive data
exchange volumes during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-### 1. Mixed Base File Format Support
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline
and out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+**Storage Schema**:
+```json
+{
+ "type": "record",
+ "name": "Blob",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "offset", "type": "long"},
+ {"name": "length", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
+ ]
+}
+```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+The reference will be the latest for that row based on the table's defined
merge mode. Similarly, when merging log and base files for compaction or
clustering, the merge mode will define which blob reference is returned for
that row just like the other columns.
+We will provide the user with a prebuilt transform to effectively read the
out-of-line blob data. For example, for Spark datasets we will leverage a
Map-Partitions to batch requests to read blob data when the rows correspond to
ranges within the same file.
+For Spark SQL we will provide a function that the user can leverage to
materialize the bytes from the blobs. Example syntax:
```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
+SELECT id, url, resolve_blob(image_blob) as image_bytes FROM my_table;
+```
+
+### 3. Writer
+#### Phase 1: External Blob Support
+The writer will be updated to support writing blob data as out-of-line
references.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data.
+In this phase, we will not implement inline storage or dynamic
inline/out-of-line storage based on size thresholds.
+
+Users will be able to create tables with Spark SQL as well by defining custom
DDL that allows them to specify a column as a BLOB type. Example syntax:
+```sql
+CREATE TABLE my_table (
+ id STRING,
+ url STRING,
+ image_blob BLOB
+) USING hudi
TBLPROPERTIES (
'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
+ 'primaryKey ='id'
)
```
-### 2. Dynamic Inline/Out-of-Line Storage
+### Phase 2: Inline Support
+The writer will be updated to support writing blob data as inline byte arrays.
These byte arrays will be stored directly in the base file format configured
for the table.
+The supported file formats will be optimized for inline blob storage by
setting the proper configurations for these columns such as removing
compression, setting the proper encoding, and disabling the column level
statistics.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
+#### Phase 3: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic
inline/out-of-line storage based on user configured size thresholds. The user
will still be able to provide their own external path for out-of-line storage
if desired.
+When the user provides blob data in the form of byte arrays, the writer will
take arrays larger than the configured threshold and write them to files. The
user can configure the file type used for this storage (e.g Parquet, HFile,
Lance, etc).
+The writer will then generate the appropriate BlobReference for the
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files
created in storage. All of these blobs will belong to the same file group for
ease of management.
-**Storage Pointer Schema**:
-```json
-{
- "type": "record",
- "name": "BlobPointer",
- "fields": [
- {"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
- ]
-}
-```
+**Configurations**:
+- `hoodie.storage.blob.inline.threshold`: Size threshold in bytes for inline
vs out-of-line storage.
+- `hoodie.storage.blob.outofline.container.maxElementSize`: Size threshold in
bytes for blobs that can be stored within a container file. Blobs larger than
this threshold will be stored in their own individual files.
+- `hoodie.storage.blob.outofline.container.maxFileSize`: Size threshold in
bytes for maximum size of an out-of-line blob container file.
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line
blob storage.
**External Storage Layout**:
```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{column_name}/{instant}/{blob_id}
```
-Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
+#### Writer Optimizations for Blob Storage
+##### Parquet
For unstructured column groups using Parquet:
- **Disable Compression**: Avoid double compression of already compressed
media files
- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for
BLOB columns
- **Large Page Sizes**: Configure larger page sizes to optimize for sequential
BLOB access
- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for
efficient retrieval of a single blob value.
- **Disable stats**: Not very useful for BLOB columns
-
-### 4. Lance Format Integration
-
-**Lance Advantages for Unstructured Data**:
+##### Lance
- Native support for high-dimensional vectors and embeddings
- Efficient columnar storage for mixed structured/unstructured data
- Better compression for certain unstructured data types
-Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
-file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
+Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
+file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
file formats in the space and other well-established unstructured file formats.
-### 5. Enhanced Table Services
-
-**Cleaning Service Extensions**:
-- Track external BLOB references in metadata table
-- Implement cascading deletion of external BLOB files during cleaning
-- Add BLOB-specific retention policies, using reference counting to reclaim
out-of-line blobs.
-
-**Compaction Service Extensions**:
-- Support cross-format compaction (merge Lance and Parquet column groups)
-- Implement BLOB deduplication during major compaction
-- Optimize external BLOB consolidation
-
-**Clustering Service Extensions**:
-- Enable redistribution of BLOB data across file groups
-- Support column group reconfiguration during clustering
-- Implement BLOB-aware data skipping strategies
-
-### 6. Flexible Column Group Management
-
-**Dynamic Column Group Creation**:
-```java
-// Writer API extensions
-HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
- .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
- .withNewColumnGroupThreshold(100_000_000L) // 100MB
- .withBlobStorageThreshold(1_048_576L) // 1MB
- .build();
-```
-**Column Group Reconfiguration**:
-- Support splitting existing column groups when they grow too large
-- Enable merging small column groups during maintenance operations
-- Allow migration of columns between column groups
+### 4. Table Services
+#### Cleaning
+The cleaning service will be updated to identify the out-of-line blob
references that are managed by Hudi and no longer referenced by any active file
slices.
+To identify these references, we have three options:
+1. Scan all active file slices to build a set of referenced blob IDs and then
scan the file slices being removed to identify references in the removed slices
that are not in the active set.
+2. Maintain metadata on the blob references contained in the file in the
footer or metadata section of each base and log file. The cleaner can then read
this metadata to identify blob references in the removed slices and check if
they are still referenced in active slices.
+3. Maintain an index in the metadata table that tracks all blob references and
their reference counts. The cleaner can then use this index to identify
unreferenced blobs.
-### 7. Query Engine Integration
+**Note**: This is only required for out-of-line blobs that are managed by
Hudi. Out-of-line blobs that are not managed by Hudi will not be deleted by the
cleaner. This will be part of `Phase 2` of the writer implementation.
+
+#### Blob Compaction
Review Comment:
If so, lets add a [WIP]. on the header.. so we know and can land this PR.
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,151 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a (position, length) within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the referred blob
is managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds. Sane defaults should be supported for easy
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure and massive data
exchange volumes during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-### 1. Mixed Base File Format Support
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline
and out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+**Storage Schema**:
+```json
+{
+ "type": "record",
+ "name": "Blob",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
Review Comment:
int/enum? to preserve space.?
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "position", "type": "long"},
+ {"name": "size", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
]
}
```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+
+The reader user will then use a UserDefinedFunction (UDF), or similar
abstraction based on the engine, to read the blob data from the reference when
needed.
+
+### 3. Writer
+#### Phase 1: Basic Blob Support
+The writer will be updated to support writing blob data in both inline and
out-of-line formats.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data and these references will
not be managed by Hudi (they are not removed by the cleaner).
+In this phase, we will not implement dynamic inline/out-of-line storage based
on size thresholds.
+
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
+
+#### Phase 2: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic
inline/out-of-line storage based on user configured size thresholds. The user
will still be able to provide their own external path for out-of-line storage
if desired.
+When the user provides blob data in the form of byte arrays, the writer will
take arrays larger than the configured threshold and write them to files. The
user can configure the file type used for this storage (e.g Parquet, HFile,
Lance, etc).
+The writer will then generate the appropriate BlobReference for the
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files
created in storage. All of these blobs will belong to the same file group for
ease of management.
+
+**Configurations**:
+- `hoodie.storage.blog.inline.threshold`: Size threshold in bytes for inline
vs out-of-line storage
+- `hoodie.storage.blob.outofline.packing.threshold`: Size threshold in bytes
for blobs that can be packed together in a single out-of-line file
+- `hoodie.storage.blob.outofline.packing.maxFileSize`: Size threshold in bytes
for maximum size of an out-of-line blob file
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line
blob storage
**External Storage Layout**:
```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_name}/{instant}/{blob_id}
```
-Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
+#### Writer Optimizations for Blob Storage
+##### Parquet
For unstructured column groups using Parquet:
- **Disable Compression**: Avoid double compression of already compressed
media files
- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for
BLOB columns
- **Large Page Sizes**: Configure larger page sizes to optimize for sequential
BLOB access
- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for
efficient retrieval of a single blob value.
- **Disable stats**: Not very useful for BLOB columns
-
-### 4. Lance Format Integration
-
-**Lance Advantages for Unstructured Data**:
+##### Lance
- Native support for high-dimensional vectors and embeddings
- Efficient columnar storage for mixed structured/unstructured data
- Better compression for certain unstructured data types
-Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
-file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
+Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
+file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
file formats in the space and other well-established unstructured file formats.
-### 5. Enhanced Table Services
-
-**Cleaning Service Extensions**:
-- Track external BLOB references in metadata table
-- Implement cascading deletion of external BLOB files during cleaning
-- Add BLOB-specific retention policies, using reference counting to reclaim
out-of-line blobs.
-
-**Compaction Service Extensions**:
-- Support cross-format compaction (merge Lance and Parquet column groups)
-- Implement BLOB deduplication during major compaction
-- Optimize external BLOB consolidation
-
-**Clustering Service Extensions**:
-- Enable redistribution of BLOB data across file groups
-- Support column group reconfiguration during clustering
-- Implement BLOB-aware data skipping strategies
-
-### 6. Flexible Column Group Management
-
-**Dynamic Column Group Creation**:
-```java
-// Writer API extensions
-HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
- .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
- .withNewColumnGroupThreshold(100_000_000L) // 100MB
- .withBlobStorageThreshold(1_048_576L) // 1MB
- .build();
-```
-**Column Group Reconfiguration**:
-- Support splitting existing column groups when they grow too large
-- Enable merging small column groups during maintenance operations
-- Allow migration of columns between column groups
+### 4. Table Services
+#### Cleaning
+The cleaning service will be updated to identify the out-of-line blob
references that are managed by Hudi and no longer referenced by any active file
slices.
+To identify these references, we have two options:
+1. Scan all active file slices to build a set of referenced blob IDs and then
scan the file slices being removed to identify references in the removed slices
that are not in the active set.
+2. Maintain metadata on the blob references contained in the file in the
footer or metadata section of each base and log file. The cleaner can then read
this metadata to identify blob references in the removed slices and check if
they are still referenced in active slices.
Review Comment:
+1 on starting with option 1. it baselines us
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "position", "type": "long"},
+ {"name": "size", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
]
}
```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+
+The reader user will then use a UserDefinedFunction (UDF), or similar
abstraction based on the engine, to read the blob data from the reference when
needed.
+
+### 3. Writer
+#### Phase 1: Basic Blob Support
+The writer will be updated to support writing blob data in both inline and
out-of-line formats.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data and these references will
not be managed by Hudi (they are not removed by the cleaner).
+In this phase, we will not implement dynamic inline/out-of-line storage based
on size thresholds.
+
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
+
+#### Phase 2: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic
inline/out-of-line storage based on user configured size thresholds. The user
will still be able to provide their own external path for out-of-line storage
if desired.
+When the user provides blob data in the form of byte arrays, the writer will
take arrays larger than the configured threshold and write them to files. The
user can configure the file type used for this storage (e.g Parquet, HFile,
Lance, etc).
+The writer will then generate the appropriate BlobReference for the
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files
created in storage. All of these blobs will belong to the same file group for
ease of management.
+
+**Configurations**:
+- `hoodie.storage.blog.inline.threshold`: Size threshold in bytes for inline
vs out-of-line storage
+- `hoodie.storage.blob.outofline.packing.threshold`: Size threshold in bytes
for blobs that can be packed together in a single out-of-line file
+- `hoodie.storage.blob.outofline.packing.maxFileSize`: Size threshold in bytes
for maximum size of an out-of-line blob file
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line
blob storage
**External Storage Layout**:
```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_name}/{instant}/{blob_id}
```
-Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
+#### Writer Optimizations for Blob Storage
+##### Parquet
For unstructured column groups using Parquet:
Review Comment:
anything here?
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -222,13 +220,31 @@ HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
- Efficient BLOB streaming for distributed ML workloads
- Integration with Ray's object store for large BLOB caching
-### 8. Metadata Table Extensions
+### 6. Metadata Table Extensions
- Track BLOB references for garbage collection
- Store maintain indexes for parquet based blob storage
- Maintain size statistics for storage optimization
- Support BLOB-based query optimization
+## Development Plan
+
+#### Milestone 1: External Blob Support
Review Comment:
I will look at the PR. the mapPartitions() approach sg. I did not intend per
record processing. that will be bad. I agree.
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,151 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a (position, length) within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the referred blob
is managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds. Sane defaults should be supported for easy
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure and massive data
exchange volumes during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-### 1. Mixed Base File Format Support
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline
and out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+**Storage Schema**:
+```json
+{
+ "type": "record",
+ "name": "Blob",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "offset", "type": "long"},
+ {"name": "length", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
+ ]
+}
+```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+We will provide the user with a prebuilt transform to effectively read the
out-of-line blob data. For example, for Spark datasets we will leverage a
Map-Partitions to batch requests to read blob data when the rows correspond to
ranges within the same file.
+For Spark SQL we will provide a function that the user can leverage to
materialize the bytes from the blobs. Example syntax:
```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
+SELECT id, url, resolve_blob(image_blob) as image_bytes FROM my_table;
Review Comment:
lets call this simply `read_blob()` or `read_blob_data()`
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
Review Comment:
fully qualified is okay for now. `blobBasePath` can be anywhere, not
necessarily sub folder.
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "position", "type": "long"},
+ {"name": "size", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
]
}
```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+
+The reader user will then use a UserDefinedFunction (UDF), or similar
abstraction based on the engine, to read the blob data from the reference when
needed.
+
+### 3. Writer
+#### Phase 1: Basic Blob Support
+The writer will be updated to support writing blob data in both inline and
out-of-line formats.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data and these references will
not be managed by Hudi (they are not removed by the cleaner).
+In this phase, we will not implement dynamic inline/out-of-line storage based
on size thresholds.
+
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
+
+#### Phase 2: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic
inline/out-of-line storage based on user configured size thresholds. The user
will still be able to provide their own external path for out-of-line storage
if desired.
+When the user provides blob data in the form of byte arrays, the writer will
take arrays larger than the configured threshold and write them to files. The
user can configure the file type used for this storage (e.g Parquet, HFile,
Lance, etc).
+The writer will then generate the appropriate BlobReference for the
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files
created in storage. All of these blobs will belong to the same file group for
ease of management.
+
+**Configurations**:
+- `hoodie.storage.blog.inline.threshold`: Size threshold in bytes for inline
vs out-of-line storage
+- `hoodie.storage.blob.outofline.packing.threshold`: Size threshold in bytes
for blobs that can be packed together in a single out-of-line file
+- `hoodie.storage.blob.outofline.packing.maxFileSize`: Size threshold in bytes
for maximum size of an out-of-line blob file
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line
blob storage
**External Storage Layout**:
```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_name}/{instant}/{blob_id}
```
-Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
+#### Writer Optimizations for Blob Storage
+##### Parquet
For unstructured column groups using Parquet:
- **Disable Compression**: Avoid double compression of already compressed
media files
- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for
BLOB columns
- **Large Page Sizes**: Configure larger page sizes to optimize for sequential
BLOB access
- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for
efficient retrieval of a single blob value.
- **Disable stats**: Not very useful for BLOB columns
-
-### 4. Lance Format Integration
-
-**Lance Advantages for Unstructured Data**:
+##### Lance
Review Comment:
Can we file the gaps you found as issues.. (if not done already)
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,129 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+1**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a position within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the reference is
managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
-
-### 1. Mixed Base File Format Support
-
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
-
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
-
-
-```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
-TBLPROPERTIES (
- 'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
-)
-```
-
-### 2. Dynamic Inline/Out-of-Line Storage
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+### 1. Storage Abstraction
+We will add a Blob type to the HoodieSchema that encapsulates both inline and
out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-
-**Storage Pointer Schema**:
+**Storage Schema**:
```json
{
"type": "record",
- "name": "BlobPointer",
+ "name": "Blob",
"fields": [
{"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "position", "type": "long"},
+ {"name": "size", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
]
}
```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
+
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+
+The reader user will then use a UserDefinedFunction (UDF), or similar
abstraction based on the engine, to read the blob data from the reference when
needed.
+
+### 3. Writer
+#### Phase 1: Basic Blob Support
+The writer will be updated to support writing blob data in both inline and
out-of-line formats.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data and these references will
not be managed by Hudi (they are not removed by the cleaner).
+In this phase, we will not implement dynamic inline/out-of-line storage based
on size thresholds.
+
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
Review Comment:
ok
##########
rfc/rfc-100/rfc-100.md:
##########
@@ -63,153 +63,152 @@ column that already exists.
### Building on Existing Foundation
-This RFC leverages two key foundation pieces:
+This RFC leverages foundation pieces:
-1. **RFC-80 Column Groups**: Provides the mechanism to split file groups
across different column groups, enabling efficient storage of different data
types within the same logical file group.
-
-2. **RFC-99 BLOB Types**: Introduces BINARY and LARGE_BINARY types to the Hudi
type system, providing the type foundation for unstructured data storage.
+**RFC-99 BLOB Types**: Introduces BLOB types to the Hudi type system,
providing the type foundation for unstructured data storage.
## Requirements
Below are the high-level requirements for this feature.
1. Users must be able to define tables with a mix of structured (current
types) and unstructured (blob type)
columns
-2. Records are distributed across file groups like regular Hudi storage layout
into file groups. But within each
- file group, structured and unstructured columns are split into different
column groups. This way the table can
- also scalably grow in terms of number of columns.
-3. Unstructured data can be stored inline (e.g small images right inside the
column group) or out-of-line (e.g
- pointer to a multi-GB video file someplace). This decision should be made
dynamically during write/storage time.
+2. Unstructured data can be stored inline (e.g small images right inside the
file) or out-of-line (e.g
+ pointer to a multi-GB video file someplace). Out-of-line references can
also include a (position, length) within the file which
+ allows multiple blobs to be stored within a single file to reduce the
number of files in storage.
+ This decision should be made dynamically during write/storage time.
3. All table life-cycle operations and table services work seamlessly across
both column types.for e.g cleaning
- the file slices should reclaim both inline and out-of-line blob data.
Clustering should be able re-organize
- records across file groups or even redistribute columns across column
groups within the same file group.
-4. Storage should support different column group distributions i.e different
membership of columns
- across column groups, across file groups, to ensure users or table services
can flexibly reconfigure all this as
- table grows, without re-writing all of the data.
-5. Hudi should expose controls at the writer level, to control whether new
columns are written to new column
- groups or expand an existing column group within a file group.
+ the file slices should reclaim out-of-line blob data when the referred blob
is managed by Hudi.
+ Clustering should be able re-organize records across file groups or even
repack blobs if required.
+4. Hudi should expose controls at the writer level, to control whether to
store blobs inline or out-of-line
+ based on size thresholds. Sane defaults should be supported for easy
out-of-box experience. for e.g < 1MB is stored inline. > 16 MB is always stored
out-of-line.
+5. Query engines like Spark should be able to read the unstructured data and
materialize the values lazily to reduce memory pressure and massive data
exchange volumes during shuffles.
## High-Level Design
-The design introduces a hybrid storage model where each file group can contain
multiple column groups with different file formats optimized for their data
types. Structured columns continue using
-Parquet format, while unstructured columns can use specialized formats like
Lance or optimized Parquet configurations or HFile for random-access.
+The design introduces an abstraction that allows inline and out-of-line
storage of byte arrays that work seamlessly for the end user. Structured
columns continue using
+Parquet format, while unstructured data can use specialized formats like Lance
or optimized Parquet configurations or HFile for random-access.
-### 1. Mixed Base File Format Support
+### 1. Storage Abstraction
+We will add a `blob` type to the HoodieSchema that encapsulates both inline
and out-of-line storage strategies. This will allow the user to use a mix of
storage strategies seamlessly.
-**Per-Column Group Format Selection**: Each column group within a file group
can use different base file formats:
-- **Structured Column Groups**: Continue using Parquet with standard
optimizations
-- **Unstructured Column Groups**: Use Lance format for vector/embedding data
or specially configured Parquet for BLOB storage
+**Storage Schema**:
+```json
+{
+ "type": "record",
+ "name": "Blob",
+ "fields": [
+ {"name": "storage_type", "type": "string"},
+ {"name": "data", "type": ["null", "bytes"]},
+ {"name": "reference", "type": ["null", {
+ "type": "record",
+ "name": "BlobReference",
+ "fields": [
+ {"name": "external_path", "type": "string"},
+ {"name": "offset", "type": "long"},
+ {"name": "length", "type": "long"},
+ {"name": "managed", "type": "boolean"}
+ ]
+ }]}
+ ]
+}
+```
+The `managed` flag will be used by the cleaner to determine if an out-of-line
blob should be deleted when cleaning up old file slices. This allows users to
point to existing files without Hudi deleting them.
-**Format Configuration**: File format is determined at column group creation
time based on (per the current RFC-80).
-But, ideally all these configurations should be automatic and Hudi should
auto-generate colum group names and mappings.
+### 2. Reader
+Readers will be updated to allow for lazy loading of the blob data, even when
it is inline. This will help reduce memory pressure during shuffles in
distributed query engines like Spark.
+The readers will return a reference to the blob data in the form of a path,
position, and size. This applies for both inline and out-of-line storage.
+The reference will be the latest for that row based on the table's defined
merge mode. Similarly, when merging log and base files for compaction or
clustering, the merge mode will define which blob reference is returned for
that row just like the other columns.
+We will provide the user with a prebuilt transform to effectively read the
out-of-line blob data. For example, for Spark datasets we will leverage a
Map-Partitions to batch requests to read blob data when the rows correspond to
ranges within the same file.
+For Spark SQL we will provide a function that the user can leverage to
materialize the bytes from the blobs. Example syntax:
```sql
-CREATE TABLE multimedia_catalog (
- id BIGINT,
- product_name STRING,
- category STRING,
- image BINARY,
- video LARGE_BINARY,
- embeddings ARRAY<FLOAT>
-) USING HUDI
+SELECT id, url, resolve_blob(image_blob) as image_bytes FROM my_table;
+```
+
+### 3. Writer
+#### Phase 1: External Blob Support
+The writer will be updated to support writing blob data as out-of-line
references.
+For out-of-line storage, the assumption is that the user will provide the
external path, position, and size of the blob data.
+In this phase, we will not implement inline storage or dynamic
inline/out-of-line storage based on size thresholds.
+
+Users will be able to create tables with Spark SQL as well by defining custom
DDL that allows them to specify a column as a BLOB type. Example syntax:
+```sql
+CREATE TABLE my_table (
+ id STRING,
+ url STRING,
+ image_blob BLOB
+) USING hudi
TBLPROPERTIES (
'hoodie.table.type' = 'MERGE_ON_READ',
- 'hoodie.bucket.index.hash.field' = 'id',
- 'hoodie.columngroup.structured' = 'id,product_name,category;id',
- 'hoodie.columngroup.images' = 'id,image;id',
- 'hoodie.columngroup.videos' = 'id,video;id',
- 'hoodie.columngroup.ml' = 'id,embeddings;id',
- 'hoodie.columngroup.images.format' = 'parquet',
- 'hoodie.columngroup.videos.format' = 'lance',
- 'hoodie.columngroup.ml.format' = 'hfile'
+ 'primaryKey ='id'
)
```
-### 2. Dynamic Inline/Out-of-Line Storage
+### Phase 2: Inline Support
+The writer will be updated to support writing blob data as inline byte arrays.
These byte arrays will be stored directly in the base file format configured
for the table.
+The supported file formats will be optimized for inline blob storage by
setting the proper configurations for these columns such as removing
compression, setting the proper encoding, and disabling the column level
statistics.
-**Storage Decision Logic**: During write time, Hudi determines storage
strategy based on:
-- **Inline Storage**: BLOB data < 1MB stored directly in the column group
file, to avoid excessive cloud storage API calls.
-- **Out-of-Line Storage**: Large BLOB data stored in dedicated object store
locations with pointers in the main file, to avoid write amplification during
updates.
+The writer should be flexible enough to allow the user to pass in a dataset
with blob data as simple byte arrays or records matching the Blob schema
defined above.
+#### Phase 3: Dynamic Inline/Out-of-Line Storage
+In this phase, the writer will be updated to support dynamic
inline/out-of-line storage based on user configured size thresholds. The user
will still be able to provide their own external path for out-of-line storage
if desired.
+When the user provides blob data in the form of byte arrays, the writer will
take arrays larger than the configured threshold and write them to files. The
user can configure the file type used for this storage (e.g Parquet, HFile,
Lance, etc).
+The writer will then generate the appropriate BlobReference for the
out-of-line storage and write that to the main file.
+Multiple blobs can be written to the same file to reduce the number of files
created in storage. All of these blobs will belong to the same file group for
ease of management.
-**Storage Pointer Schema**:
-```json
-{
- "type": "record",
- "name": "BlobPointer",
- "fields": [
- {"name": "storage_type", "type": "string"},
- {"name": "size", "type": "long"},
- {"name": "checksum", "type": "string"},
- {"name": "external_path", "type": ["null", "string"]},
- {"name": "compression", "type": ["null", "string"]}
- ]
-}
-```
+**Configurations**:
+- `hoodie.storage.blob.inline.threshold`: Size threshold in bytes for inline
vs out-of-line storage.
+- `hoodie.storage.blob.outofline.container.maxElementSize`: Size threshold in
bytes for blobs that can be stored within a container file. Blobs larger than
this threshold will be stored in their own individual files.
+- `hoodie.storage.blob.outofline.container.maxFileSize`: Size threshold in
bytes for maximum size of an out-of-line blob container file.
+- `hoodie.storage.blob.outofline.format`: File format to use for out-of-line
blob storage.
**External Storage Layout**:
```
-{table_path}/.hoodie/blobs/{partition}/{file_group_id}/{column_group}/{instant}/{blob_id}
+{table_path}/.hoodie/blobs/{partition}/{column_name}/{instant}/{blob_id}
```
-Alternatively, User should be able to specify external storage location per
BLOB during writes, as needed.
-
-### 3. Parquet Optimization for BLOB Storage
+#### Writer Optimizations for Blob Storage
+##### Parquet
For unstructured column groups using Parquet:
- **Disable Compression**: Avoid double compression of already compressed
media files
- **Plain Encoding**: Use PLAIN encoding instead of dictionary encoding for
BLOB columns
- **Large Page Sizes**: Configure larger page sizes to optimize for sequential
BLOB access
- **Metadata Index**: Maintain BLOB metadata in Hudi metadata table for
efficient retrieval of a single blob value.
- **Disable stats**: Not very useful for BLOB columns
-
-### 4. Lance Format Integration
-
-**Lance Advantages for Unstructured Data**:
+##### Lance
- Native support for high-dimensional vectors and embeddings
- Efficient columnar storage for mixed structured/unstructured data
- Better compression for certain unstructured data types
-Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
-file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
+Supporting Lance, working across Hudi + Lance communities will help users
unlock benefits of both currently supported
+file formats in Hudi (parquet, orc), along with benefits of Lance. Over time,
we could also incorporate newer emerging
file formats in the space and other well-established unstructured file formats.
-### 5. Enhanced Table Services
-
-**Cleaning Service Extensions**:
-- Track external BLOB references in metadata table
-- Implement cascading deletion of external BLOB files during cleaning
-- Add BLOB-specific retention policies, using reference counting to reclaim
out-of-line blobs.
-
-**Compaction Service Extensions**:
-- Support cross-format compaction (merge Lance and Parquet column groups)
-- Implement BLOB deduplication during major compaction
-- Optimize external BLOB consolidation
-
-**Clustering Service Extensions**:
-- Enable redistribution of BLOB data across file groups
-- Support column group reconfiguration during clustering
-- Implement BLOB-aware data skipping strategies
-
-### 6. Flexible Column Group Management
-
-**Dynamic Column Group Creation**:
-```java
-// Writer API extensions
-HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
- .withColumnGroupStrategy(ADAPTIVE) // AUTO, FIXED, ADAPTIVE
- .withNewColumnGroupThreshold(100_000_000L) // 100MB
- .withBlobStorageThreshold(1_048_576L) // 1MB
- .build();
-```
-**Column Group Reconfiguration**:
-- Support splitting existing column groups when they grow too large
-- Enable merging small column groups during maintenance operations
-- Allow migration of columns between column groups
+### 4. Table Services
+#### Cleaning
+The cleaning service will be updated to identify the out-of-line blob
references that are managed by Hudi and no longer referenced by any active file
slices.
+To identify these references, we have three options:
+1. Scan all active file slices to build a set of referenced blob IDs and then
scan the file slices being removed to identify references in the removed slices
that are not in the active set.
+2. Maintain metadata on the blob references contained in the file in the
footer or metadata section of each base and log file. The cleaner can then read
this metadata to identify blob references in the removed slices and check if
they are still referenced in active slices.
+3. Maintain an index in the metadata table that tracks all blob references and
their reference counts. The cleaner can then use this index to identify
unreferenced blobs.
-### 7. Query Engine Integration
+**Note**: This is only required for out-of-line blobs that are managed by
Hudi. Out-of-line blobs that are not managed by Hudi will not be deleted by the
cleaner. This will be part of `Phase 2` of the writer implementation.
+
+#### Blob Compaction
Review Comment:
this is all subject to change I guess. since we have not sketched out the
storage changes for packing etc..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]