Hi Josep,

Thanks for providing more details. Extending the metadata table schema to 
include your custom metadata is a reasonable approach and aligns with Hudi’s 
design principles for storing metadata and indexes efficiently.

To extend the schema in the right way, we do need to consider a few things:

1. The structure you mentioned suggests a zonemap-like index, but at a finer 
granularity - per block within a file. Could you clarify if "blocks" refer to 
Parquet row groups, or are they specific to Qbeast?

2. Would a more fine-grained column stats index (i.e., extending the column 
stats payload) suffice, or should we consider introducing a dedicated 
block-level index, such as a `BlockStatsIndex`? Either way, Hudi’s metadata 
table supports schema evolution, so introducing new fields would not break 
existing users.

3. How do you envision updating the block-level stats? The 
`HoodieBackedTableMetadataWriter` class provides APIs for updating the metadata 
table, so the primary consideration would be designing the update algorithm for 
your index.

I would highly encourage you to open an RFC, especially if you believe this 
index could serve as a general-purpose block-level zonemap and be beneficial to 
the wider Hudi community. The community would benefit from your insights, 
particularly if this aligns with broader indexing enhancements in Hudi.

Regards,  
Sagar

On 2025/01/20 16:24:30 Josep Sampé wrote:
> Hi Sivabalan,
> 
> Thanks for your response. The metadata we need to store is indeed per-file,
> and it is leveraged primarily during reads. Currently, we are using the
> extraMetadata field in the commit files, but this approach requires reading
> both the active and archive timelines to extract the information during
> reads.
> 
> We are exploring a solution where the metadata is stored in the
> MetadataTable for faster retrieval and improved performance. This would
> also help align with Hudi's internals, as the MetadataTable is primarily
> used for storing indexes and other metadata-related information.
> 
> In our solution, we would aim to extend the metadata table schema and
> include something like this:
> 
> |-- qbeastMetadata: struct (nullable = true)
>     |-- fileName: string (nullable = false)
>     |-- revision: integer (nullable = false)
>     |-- blocks: struct (nullable = false)
>         |-- id: integer (nullable = false)
>         |-- min: integer (nullable = false)
>         |-- max: integer (nullable = false)
>         |-- elementCount: integer (nullable = false)
> 
> Looking at the code, it seems that the default schema is defined in the
> HoodieMetadata.avsc file (
> https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc),
> and the classes that manage the table are generated automatically for this
> schema.
> 
> Our question is: what would be the proper way to extend the default schema
> to include the metadata we need and generate the classes to manage it?
> 
> Best regards,
> 
> -Josep
> 

Reply via email to