Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Amogh Jahagirdar Tue, 30 Dec 2025 10:59:30 -0800

Meant to say *partition tuple on data files.

The other part I should clarify is that the reason why the stats on
expressions approach preserves pruning is the majority of transforms
(except bucketing) are monotonic. So the lower/upper bounds on those stats
effectively "cover" the same amount. For bucketing, we'd have to store the
result of the bucketing function in stats as well, so we can also preserve
the ability to do things like bucket joins.


On Tue, Dec 30, 2025 at 11:46 AM Amogh Jahagirdar <[email protected]> wrote:

> Hey Viquar,
>
> There shouldn't be a read regression here since the data files would have
> columnar stats which would cover the ability to prune based on partitions
> (since essentially all the partition transforms are derivations on a source
> data column). There's been discussions in the sync on if we should keep the
> partition tuple for manifests and there's nuances on writer requirements if
> we were to completely rely on column stats, but regardless of if the
> partition tuple is kept or not, from a pruning perspective we certainly
> want to keep the same level pruning as we had before; that's a critical
> property to preserve.
>
> If we model the partition transform as an expression with its own ID, we
> could then have stats on that expression.  e.g. if you have a column ts,
> and partitioning days(ts), there'd be an expression <http:///> in
> metadata representing days(ts), and in stats for the data file there'd be a
> stat entry containing lower(days(ts)) and upper(days(ts)). For a
> partitioned file, the lower and upper bounds would have to be equal. For a
> leaf manifest in the root, we'd have the aggregated lower/upper stats which
> is effectively the same as the partition field summary that exists today.
> Then in short, a reader could just run data filters and get the same level
> of pruning as before. Notice that in this modeling we avoid having to tie a
> manifest to a given partition spec like what happens today.
>
> I do think the aspect to get to more of a conclusion on is if we should
> keep the partition tuple or completely rely on stats on expressions. For
> reference, from a past v4 sync
> <https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view?usp=sharing&t=2327>
>  discussion
> on this topic (linked to the time the discussion start). Let me know if
> that makes sense!
>
> On Tue, Dec 30, 2025 at 10:47 AM vaquar khan <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> I’ve been following the recent discussions and design documents regarding
>> the Adaptive Metadata Tree and Single-File Commits for the V4 Spec.
>>
>> While moving to a Root Manifest structure solves the write amplification
>> issue on S3/GCS, I am concerned about a potential regression in Partition
>> Pruning efficiency for readers. Specifically, when Data Files are inlined
>> into the Root Manifest, we lose the explicit partition summary bounds that
>> existed in the V3 Manifest List.
>>
>> Without a standardized way to store lightweight partition stats for these
>> inlined entries, query planners may be forced to scan significantly more
>> metadata bytes to perform the same pruning we get for free today.
>>
>> *Proposal*: I propose we explicitly standardize a "Compact Partition
>> Summary" (possibly using Bloom Filters or compressed min/max tuples) within
>> the Root Manifest entry schema. This would ensure that V4 maintains the
>> "File Skipping" performance of V3 while gaining the write throughput of the
>> new tree structure.
>>
>> I am drafting a short design doc outlining the schema changes and
>> backward compatibility implications for this.
>>
>> Before I circulate the doc, has there been any consensus on how to handle
>> partition stats for inlined files in the combined Spitzer/Jahagirdar
>> proposal?
>>
>> Regards,
>> Viquar Khan
>> Sr. Data Architect
>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>
>

Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Reply via email to