GitHub user suryaprasanna edited a discussion: Support Selective Metafield Population: Enable Only _hoodie_commit_time for Incremental Reads
### Context Hudi currently supports an all-or-nothing approach for metafield population. Users can either: - Populate all metafields (**_hoodie_record_key, _hoodie_commit_time, _hoodie_partition_path, _hoodie_file_name**, etc.) - Disable all metafields to save on storage costs When metafields are disabled, Hudi operations still function correctly because certain fields like _hoodie_record_key are virtualized - they're automatically generated on-the-go during operations like upserts. This virtualization approach works seamlessly without requiring the field to be physically stored. However, incremental reads rely on the **_hoodie_commit_time** metafield to identify and read delta changes at the record level. This field is essential for incremental query patterns, enabling Hudi to provide efficient record-level change tracking. ### Problem Statement Users face a trade-off between storage efficiency and incremental read capabilities: - Scenario 1: Users who disable metafields save storage but lose the ability to perform incremental reads because **_hoodie_commit_time** is not available - Scenario 2: Users who enable metafields can perform incremental reads but pay for storing all metafields, even though most of them (like **_hoodie_record_key, _hoodie_partition_path**) can be virtualized and aren't strictly necessary The gap: There's no option to selectively enable only the metafields that cannot be virtualized or are critical for certain query patterns (like **_hoodie_commit_time** for incremental reads) while excluding others that can be generated on-the-go. This results in unnecessary storage overhead for users who need incremental reads but don't require the other metafields to be physically stored. ### Proposed Solution Introduce selective metafield population that allows users to: - Enable only **_hoodie_commit_time** for incremental read support - Nullify/disable other metafields (**_hoodie_record_key, _hoodie_partition_path, _hoodie_file_name**) that can be virtualized - Achieve storage savings while maintaining incremental query capabilities This would provide a middle ground between the current all-or-nothing approach, optimizing for both storage efficiency and functional requirements. GitHub link: https://github.com/apache/hudi/discussions/17959 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
