Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

vaquar khan Wed, 31 Dec 2025 17:30:26 -0800

Hey Amogh,

Thanks for the clarification and the video links.


I agree that expression-based stats work well for monotonic transforms like
day(ts), so we are aligned there.

My concern is specifically around the non-monotonic cases (bucketing) and
the impact on server-side planning. As you noted, we'd need to store bucket
function results in the stats to make that work. Doing that at the file
level introduces a lot of redundancy, and more importantly, it forces the
planner to deserialize every single inlined file entry to check those
stats. That linear scan is going to be a bottleneck for REST catalogs
handling heavy concurrent loads.

I’ve written up a proposal for a hybrid approach that solves this. It uses
a "Compact Partition Summary" (effectively a virtual manifest header) for
group-level pruning, while relying on your expression stats for the
file-level pruning.

Here is the design doc:
https://docs.google.com/document/d/1mYTEK5eA6IjOc6yxRCvEBIzdbJO-rjXbr3YHtgWKNdo/edit?tab=t.0#heading=h.xr9w02y1yuod
<https://docs.google.com/document/d/1mYTEK5eA6IjOc6yxRCvEBIzdbJO-rjXbr3YHtgWKNdo/edit?tab=t.0&authuser=1#heading=h.xr9w02y1yuod>

This way we get the V4 write throughput but keep the O(1) pruning for the
active tail. Let me know if this hybrid structure works for you.

Regards,
Viquar Khan
Sr Data Architect
https://www.linkedin.com/in/vaquar-khan-b695577/

On Wed, 31 Dec 2025 at 00:24, vaquar khan <[email protected]> wrote:

> Thanks, Amogh. I really appreciate your quick and detailed response. I’m
> going to watch the video now and will get back to you shortly with my
> thoughts.
>
> Regards,
> Viquar Khan
> Sr Data Architect
> https://www.linkedin.com/in/vaquar-khan-b695577/
>
> On Tue, 30 Dec 2025 at 12:59, Amogh Jahagirdar <[email protected]> wrote:
>
>> Meant to say *partition tuple on data files.
>>
>> The other part I should clarify is that the reason why the stats on
>> expressions approach preserves pruning is the majority of transforms
>> (except bucketing) are monotonic. So the lower/upper bounds on those stats
>> effectively "cover" the same amount. For bucketing, we'd have to store the
>> result of the bucketing function in stats as well, so we can also preserve
>> the ability to do things like bucket joins.
>>
>> On Tue, Dec 30, 2025 at 11:46 AM Amogh Jahagirdar <[email protected]>
>> wrote:
>>
>>> Hey Viquar,
>>>
>>> There shouldn't be a read regression here since the data files would
>>> have columnar stats which would cover the ability to prune based on
>>> partitions (since essentially all the partition transforms are derivations
>>> on a source data column). There's been discussions in the sync on if we
>>> should keep the partition tuple for manifests and there's nuances on writer
>>> requirements if we were to completely rely on column stats, but regardless
>>> of if the partition tuple is kept or not, from a pruning perspective we
>>> certainly want to keep the same level pruning as we had before; that's a
>>> critical property to preserve.
>>>
>>> If we model the partition transform as an expression with its own ID, we
>>> could then have stats on that expression.  e.g. if you have a column ts,
>>> and partitioning days(ts), there'd be an expression <http:///> in
>>> metadata representing days(ts), and in stats for the data file there'd be a
>>> stat entry containing lower(days(ts)) and upper(days(ts)). For a
>>> partitioned file, the lower and upper bounds would have to be equal. For a
>>> leaf manifest in the root, we'd have the aggregated lower/upper stats which
>>> is effectively the same as the partition field summary that exists today.
>>> Then in short, a reader could just run data filters and get the same
>>> level of pruning as before. Notice that in this modeling we avoid having to
>>> tie a manifest to a given partition spec like what happens today.
>>>
>>> I do think the aspect to get to more of a conclusion on is if we should
>>> keep the partition tuple or completely rely on stats on expressions. For
>>> reference, from a past v4 sync
>>> <https://drive.google.com/file/d/1gv8TrR5xzqqNxek7_sTZkpbwQx1M3dhK/view?usp=sharing&t=2327>
>>>  discussion
>>> on this topic (linked to the time the discussion start). Let me know if
>>> that makes sense!
>>>
>>> On Tue, Dec 30, 2025 at 10:47 AM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I’ve been following the recent discussions and design documents
>>>> regarding the Adaptive Metadata Tree and Single-File Commits for the V4
>>>> Spec.
>>>>
>>>> While moving to a Root Manifest structure solves the write
>>>> amplification issue on S3/GCS, I am concerned about a potential regression
>>>> in Partition Pruning efficiency for readers. Specifically, when Data Files
>>>> are inlined into the Root Manifest, we lose the explicit partition summary
>>>> bounds that existed in the V3 Manifest List.
>>>>
>>>> Without a standardized way to store lightweight partition stats for
>>>> these inlined entries, query planners may be forced to scan significantly
>>>> more metadata bytes to perform the same pruning we get for free today.
>>>>
>>>> *Proposal*: I propose we explicitly standardize a "Compact Partition
>>>> Summary" (possibly using Bloom Filters or compressed min/max tuples) within
>>>> the Root Manifest entry schema. This would ensure that V4 maintains the
>>>> "File Skipping" performance of V3 while gaining the write throughput of the
>>>> new tree structure.
>>>>
>>>> I am drafting a short design doc outlining the schema changes and
>>>> backward compatibility implications for this.
>>>>
>>>> Before I circulate the doc, has there been any consensus on how to
>>>> handle partition stats for inlined files in the combined Spitzer/Jahagirdar
>>>> proposal?
>>>>
>>>> Regards,
>>>> Viquar Khan
>>>> Sr. Data Architect
>>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>>
>>>
>
> --
> Regards,
> Vaquar Khan
>
>

-- 
Regards,
Vaquar Khan

Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Reply via email to