Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Amogh Jahagirdar Thu, 01 Jan 2026 13:43:35 -0800

>If a reader queries for bucket(user_id) = 5, the Parquet footer stats for 
>*every
single file* will report a range of [0, 15]. Min/max pruning eliminates
nothing. To determine if bucket 5 actually exists in those files, the REST
server or engine must now project the column chunk and decode the
dictionary/data pages for all 100 entries.
>1. In V3, we could skip a whole manifest group using a single summary
tuple (O(1) skipping). In this V4 scenario, we move to O(N) internal
scanning. Does the spec intend to accept this linear scan cost as the
trade-off for write throughput, or is there a "pre-index" mechanism I’m
missing that avoids decoding data pages for every sub-second query?
>2. For a high-concurrency REST Catalog, the CPU and memory overhead of
performing "partial" Parquet decodes on hundreds of inlined entries per
request seems non-trivial. How do we ensure the catalog remains performant
if it has to become a "mini-query engine" just to perform basic partition
pruning?

The provided example doesn't make sense to me since it contradicts
bucketing as a partition transform. If user_id was bucketed on, this
*necessarily* means that all values in a given file would have the same
bucket values of user_id (a file which is partitioned by something must
have all of the same partition values). So File 1 could *not* contain a
spread of buckets {0, 5, 14, 15} (and same principle with the rest of the
files). Of course there can be many user_ids in a file, but the *bucket*
they'd be in for a given file must be the same otherwise we cannot say that
file is bucketed.

Let's say it wasn't bucketing, and it was some arbitrary clustering
function represented through expressions. We can't say it's a regression
because today this pruning based on column or derived stats is not even
possible at the root level of the metadata tree, since manifest lists only
have the upper/lower on partition values that exist in a given manifest. So
the new version should be a net improvement for planning costs.

*In fact, even if this was possible today* this scenario we still wouldn't
be able say it's a regression in planning because we'd then be comparing
the cost of I/O and decoding N manifest entries in the Avro manifest list
(in today's fast append in the worst case there'd be a single manifest per
single write, putting aside manifest rewrites), vs the cost of I/o and
decoding X data files in the root buffer + M manifests (the fanout from the
root of the tree) in the Parquet root manifest. So if we're analyzing the
cost of reading the root of the metadata tree, the core of that really is
the cost of reading the V4 Parquet root manifest vs the cost of reading the
Avro manifest list. This comes down to numbers (more on this later) but
assuming comparable sizes and logical contents in the root, just from a
theoretical perspective we can see that planning cost is an improvement in
v4.

>If the solution to the scan cost is to flush to leaf manifests more
frequently, don't we risk re-introducing the file-commit frequency issues
(and S3/GCS throttling) that Single-File Commits were specifically designed
to solve?

Not necessarily, as the flushing to leaf manifests doesn't need to happen
on the ingest path, it may be a background maintenance (this choice of
if/how/when to rebalance is part of the whole "adaptive" part), but I do
think here's where we should provide numbers to better demonstrate this.
This is a matter of an amortized analysis (root size, expected commit
latency, rebalancing costs given some frequency and clustering etc) for
different types of workloads at different scale factors, i.e. how much of a
buffer for single file commits in the root is desirable, at any given point
in time, whereas batch workloads generally won't really care too much about
this.

I think it's important to emphasize here that while one of the goals is to
enable the table format to do single file commits for small writes, this
doesn't mean that *every *write *always* has to be a single file commit;
there are a different set of tradeoffs that are imposed by that. The
proposed adaptive structure does allow for that if desired by doing
background maintenance, or writers can choose to incur that cost at some
point of their choosing.
Additionally, one of the other tests that was run a while back (that I
don't think is on the doc, but I'll update it with those details on the
appendix) was a simple s3 put latency test on different root sizes; I
believe what was tested was root sizes ranging from 3kb to 4mb, and while
of course latency increases, it did not increase *linearly *with respect to
size, and the latency differences are much smaller than one would expect.

>My proposal for the Compact Partition Summary (CPS) is essentially a
"fast-lane" index in the Root Manifest header. It provides group-level
exclusion for these "dirty" streaming buffers so we don't have to touch the
Parquet data pages at all unless we know the data is there.

I'll take a look at this when I get a chance this week but I'm pretty
skeptical that this additional complexity at the root level really buys us
anything here. From a quick scan I did of the contents, parts of the
proposal look to be AI generated, which is fine, but the
assumptions/conclusions drawn by it aren't quite right imo.

I do think this discussion has brought up a good point around numbers, and
we didn't get to it in the last sync but one of the topics we wanted to
discuss was around inline bitmap representations (which has metadata
footprint implications) , and relates to metadata maintenance costs for DML
heavy operations. I was planning on setting up another sync when more
people are back from holidays, and since this topic also relates to
manifest entry sizes, and scaling dynamics, perhaps we could discuss it in
the sync as well? That also gives others more time to understand and
comment on the proposal.

Thanks,

Amogh Jahagirdar

On Thu, Jan 1, 2026 at 11:07 AM vaquar khan <[email protected]> wrote:

> Hi Amogh,
>
> Thanks for the detailed perspective. I’ve updated the document
> permissions—I’d love to get your specific thoughts on the schema sections.
>
> I want to dig deeper into the "Read Regression" point because I think we
> might be looking at different use cases. I completely agree that for
> batch-processed, well-sorted data, Parquet's columnar stats are a win.
> However, the "hard blocker" I’m seeing is in the streaming ingest tail.
>
>  Imagine a buffer in the Root Manifest with 100 inlined files. Because
> streaming data arrives by time and not by partition, every file contains a
> random scatter of user_id buckets.
>
>    -
>
>    *File 1:* contains buckets {0, 5, 14, 15} → Parquet Min/Max: [0, 15]
>    -
>
>    *File 2:* contains buckets {1, 2, 8, 15} → Parquet Min/Max: [0, 15]
>    -
>
>    ... and so on for all 100 files.
>
> If a reader queries for bucket(user_id) = 5, the Parquet footer stats for 
> *every
> single file* will report a range of [0, 15]. Min/max pruning eliminates
> nothing. To determine if bucket 5 actually exists in those files, the REST
> server or engine must now project the column chunk and decode the
> dictionary/data pages for all 100 entries.
>
> This leads to a few questions I’m struggling with regarding the current V4
> vision:
>
>    1.
>
>     In V3, we could skip a whole manifest group using a single summary
>    tuple (O(1) skipping). In this V4 scenario, we move to O(N) internal
>    scanning. Does the spec intend to accept this linear scan cost as the
>    trade-off for write throughput, or is there a "pre-index" mechanism I’m
>    missing that avoids decoding data pages for every sub-second query?
>    2.
>
>    For a high-concurrency REST Catalog, the CPU and memory overhead of
>    performing "partial" Parquet decodes on hundreds of inlined entries per
>    request seems non-trivial. How do we ensure the catalog remains performant
>    if it has to become a "mini-query engine" just to perform basic partition
>    pruning?
>    3.
>
>     If the solution to the scan cost is to flush to leaf manifests more
>    frequently, don't we risk re-introducing the file-commit frequency issues
>    (and S3/GCS throttling) that Single-File Commits were specifically designed
>    to solve?
>
> My proposal for the Compact Partition Summary (CPS) is essentially a
> "fast-lane" index in the Root Manifest header. It provides group-level
> exclusion for these "dirty" streaming buffers so we don't have to touch the
> Parquet data pages at all unless we know the data is there.
>
> Does this scenario resonate with the performance goals you have for your
> proposal, or do you see a different way to handle the "random scatter"
> metadata problem?
> Regards,
> Viquar Khan
> Sr. Data Architect
> https://www.linkedin.com/in/vaquar-khan-b695577/
>

Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Reply via email to