Hi everyone, I’ve been following the recent discussions and design documents regarding the Adaptive Metadata Tree and Single-File Commits for the V4 Spec.
While moving to a Root Manifest structure solves the write amplification issue on S3/GCS, I am concerned about a potential regression in Partition Pruning efficiency for readers. Specifically, when Data Files are inlined into the Root Manifest, we lose the explicit partition summary bounds that existed in the V3 Manifest List. Without a standardized way to store lightweight partition stats for these inlined entries, query planners may be forced to scan significantly more metadata bytes to perform the same pruning we get for free today. *Proposal*: I propose we explicitly standardize a "Compact Partition Summary" (possibly using Bloom Filters or compressed min/max tuples) within the Root Manifest entry schema. This would ensure that V4 maintains the "File Skipping" performance of V3 while gaining the write throughput of the new tree structure. I am drafting a short design doc outlining the schema changes and backward compatibility implications for this. Before I circulate the doc, has there been any consensus on how to handle partition stats for inlined files in the combined Spitzer/Jahagirdar proposal? Regards, Viquar Khan Sr. Data Architect https://www.linkedin.com/in/vaquar-khan-b695577/
