Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Andrew Lamb Thu, 20 Nov 2025 03:10:39 -0800

One approach, which I think has served us well in the Rust ecosystem, has
been to keep the Parquet implementation in a separate library, and
carefully design APIs that enable downstream optimizations, rather than
multiple more tightly integrated implementations in different query engines.


Specifically, have you considered adding the appropriate APIs to the
parquet-java codebase (for example, to get the ranges needed to prefetch
given a set of filters)? It would take non trivial care to design these
APIs correctly, but you could then plausibly use them to implement the
system specific optimizations you describe. It may be hard to implement
parquet optimizations as a stream without more detailed information known
to the decoder.

I realize it is more common to have the Parquet reader/writer in the actual
engines (e.g. Spark and Trino) but doing so means trying to optimize /
implement best practices requires duplicated effort. Of course this comes
with tradeoffs of having to manage requirements across multiple engines and
coordinate release schedules, etc

Examples of some generic APIs in arrow-rs's Parquet reader are:
1. Filter evaluation API (not it is not part of a query engine)[1]
2. PushDecoder to separate IO from parquet decoding[2]

Andrew

[1]:
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
[2]:
https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233

On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail <[email protected]> wrote:

> Hey everyone,
>
> I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
> contributing mainly to S3A. I would like to start a discussion on
> collaborating on a single Apache level project, which will implement
> parquet input stream level optimisations like readVectored() in a unified
> place, rather than having vendor specific implementations.
>
> Last year, my team started working on an analytics accelerator for S3
> <https://github.com/awslabs/analytics-accelerator-s3> (AAL), with the goal
> of improving query performance for Spark workloads by implementing client
> side best practices. You can find more details about the project in this
> doc
> <
> https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw
> >,
> which was shared on the Iceberg mailing lists earlier this year, and the
> Iceberg issue to integrate this as the default stream here
> <https://github.com/apache/iceberg/issues/14350>.
>
> The team at Google has gcs-analytics-core
> <https://github.com/GoogleCloudPlatform/gcs-analytics-core> which
> implements Parquet stream level optimizations, and was released in
> September of this year, iceberg issue here
> <https://github.com/apache/iceberg/issues/14326>.
>
> Most parquet reader optimisations are not vendor specific, with the major
> feature set required being:
>
>    -  Parquet footer prefetching and caching - Prefetch the last X
>    bytes  (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
>    -  Vectored reads - Lets the parquet-reader pass in a list of columns
>    that can be prefetched in parallel.
>    - Sequential Prefetching - Useful for speeding up things where the whole
>    Parquet object is going to be read eg: DistCP, and should help with
>    compaction as well.
>
>
> With this in mind, I would like to propose the following:
>
>    - A new ASF project (top level or a sub project of the existing
>    hadoop/iceberg projects).
>    - Project has a goal of bringing stream reading best practices into one
>    place. Eg: For parquet, it implements footer prefetching and caching,
>    vectored reads etc.
>    - Implements non-format specific best practices/optimisations: eg:
>    Sequential prefetching and reading small objects in a single GET.
>    - Is integrated into upstream projects like Iceberg and Hadoop as a
>    replacement/alternative for the current input stream implementations.
>
> We can structure it similar to how Hadoop and Iceberg are today:
>
>    - A shared logical layer (think of it similar to hadoop-common), where
>    the common logic goes. Ideally, 80% of the code ends up here
>    (optimisations, memory management, thread pools etc.)
>    - A  light vendor specific client layer (Kind of like the
>    hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I
>    imagine different cloud stores will have different requirements on
> things
>    like optimal request sizes, concurrency and certain features that are
> not
>    common.
>
> Note: These are all high level ideas, influenced by the direction AAL has
> taken in the last year, and perhaps there is a different, more optimal way
> to this all together.
>
> From TPC-DS benchmarking my team has done, there looks to be a 10% query
> read performance gain that can be achieved through the above listed
> optimisations, and through collaboration, we can likely drive this number
> up further. For example, it would be great to discuss how Spark and the
> Parquet reader can pass any additional information they have to the stream
> (similar to vectored reads), which can help read performance.
>
> In my opinion, there is a lot of opportunity here, and collaborating on a
> single, shared ASF project helps us achieve it faster, both in terms of
> adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long
> term maintenance of libraries like these. It also gives us an opportunity
> to combine our knowledge in this space, and react to upcoming changes in
> the Parquet format.
>
> If this sounds good, as a next step I can schedule a sync post thanksgiving
> to brainstorm ideas and next steps.
>
> Thank you, and looking forward to hearing your thoughts.
>
> Ahmar
>

Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Reply via email to