Hi Josep, Thanks for providing more details. Extending the metadata table schema to include your custom metadata is a reasonable approach and aligns with Hudi’s design principles for storing metadata and indexes efficiently.
To extend the schema in the right way, we do need to consider a few things: 1. The structure you mentioned suggests a zonemap-like index, but at a finer granularity - per block within a file. Could you clarify if "blocks" refer to Parquet row groups, or are they specific to Qbeast? 2. Would a more fine-grained column stats index (i.e., extending the column stats payload) suffice, or should we consider introducing a dedicated block-level index, such as a `BlockStatsIndex`? Either way, Hudi’s metadata table supports schema evolution, so introducing new fields would not break existing users. 3. How do you envision updating the block-level stats? The `HoodieBackedTableMetadataWriter` class provides APIs for updating the metadata table, so the primary consideration would be designing the update algorithm for your index. I would highly encourage you to open an RFC, especially if you believe this index could serve as a general-purpose block-level zonemap and be beneficial to the wider Hudi community. The community would benefit from your insights, particularly if this aligns with broader indexing enhancements in Hudi. Regards, Sagar On 2025/01/20 16:24:30 Josep Sampé wrote: > Hi Sivabalan, > > Thanks for your response. The metadata we need to store is indeed per-file, > and it is leveraged primarily during reads. Currently, we are using the > extraMetadata field in the commit files, but this approach requires reading > both the active and archive timelines to extract the information during > reads. > > We are exploring a solution where the metadata is stored in the > MetadataTable for faster retrieval and improved performance. This would > also help align with Hudi's internals, as the MetadataTable is primarily > used for storing indexes and other metadata-related information. > > In our solution, we would aim to extend the metadata table schema and > include something like this: > > |-- qbeastMetadata: struct (nullable = true) > |-- fileName: string (nullable = false) > |-- revision: integer (nullable = false) > |-- blocks: struct (nullable = false) > |-- id: integer (nullable = false) > |-- min: integer (nullable = false) > |-- max: integer (nullable = false) > |-- elementCount: integer (nullable = false) > > Looking at the code, it seems that the default schema is defined in the > HoodieMetadata.avsc file ( > https://github.com/apache/hudi/blob/master/hudi-common/src/main/avro/HoodieMetadata.avsc), > and the classes that manage the table are generated automatically for this > schema. > > Our question is: what would be the proper way to extend the default schema > to include the metadata we need and generate the classes to manage it? > > Best regards, > > -Josep >