Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Suhail, Ahmar Fri, 21 Nov 2025 06:18:42 -0800

Thanks Andrew,

I think you’re referring to adding the right API’s into parquet-java library. 
The readVectored() API was added in to parquet-java a couple of years ago 
(thanks to Mukund and Steve), PR here: 
https://github.com/apache/parquet-java/pull/1139.


The issue then becomes that the underlying streams, eg: the S3AInputStream [1] 
in S3A, or the S3InputStream [2] in S3FileIO, must provide implementations for 
this. And currently we end up with implementations by each cloud provider, for 
each file system. Eg: Google’s S3A implementation is: GoogleHadoopFSInputStream 
[3].

What I’m suggesting here is that we work to get rid of this duplication, and 
have a common Apache project with a single implementation of an optimized 
stream. In my mind, this brings the Parquet java library closer to the 
underlying data stream it relies on. And If we can establish some common ground 
here, in the future, we can start looking at more changes we can make to the 
parquet java library itself.

As an example, if we wanted to make a change to allow parquet-java to pass down 
the boundaries of the current split, so optimized input streams can get all the 
relevant columns for all row groups in the current split we would have to:

1/ Make changes to parquet java to pass this info down when opening the file.
2/ Each underlying input stream implementation would have to make changes to 
make use of this info.

A common project focused on optimisations means we should only need to do this 
once and can share the work/maintenance.

Hopefully I understood what you were saying correctly! But please do let me 
know in case I’ve missed the point completely 😊

Thanks,
Ahmar

[1]: 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
[2]: 
https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3InputStream.java
[3]: 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-gcp/src/main/java/org/apache/hadoop/fs/gs/GoogleHadoopFSInputStream.java

From: Andrew Lamb <[email protected]>
Reply to: "[email protected]" <[email protected]>
Date: Thursday, 20 November 2025 at 11:10
To: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>, "[email protected]" 
<[email protected]>, "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>, "Ratnasingham, Kannan" 
<[email protected]>, "Summers, Carl" <[email protected]>, "Peace, 
Andrew" <[email protected]>, "[email protected]" <[email protected]>, 
"Basik, Fuat" <[email protected]>, "[email protected]" 
<[email protected]>, "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>, "[email protected]" 
<[email protected]>
Subject: RE: [EXTERNAL] [DISCUSS] Creating an Apache project for Parquet reader 
optimisations


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

One approach, which I think has served us well in the Rust ecosystem, has been 
to keep the Parquet implementation in a separate library, and carefully design 
APIs that enable downstream optimizations, rather than multiple more tightly 
integrated implementations in different query engines.

Specifically, have you considered adding the appropriate APIs to the 
parquet-java codebase (for example, to get the ranges needed to prefetch given 
a set of filters)? It would take non trivial care to design these APIs 
correctly, but you could then plausibly use them to implement the system 
specific optimizations you describe. It may be hard to implement parquet 
optimizations as a stream without more detailed information known to the 
decoder.

I realize it is more common to have the Parquet reader/writer in the actual 
engines (e.g. Spark and Trino) but doing so means trying to optimize / 
implement best practices requires duplicated effort. Of course this comes with 
tradeoffs of having to manage requirements across multiple engines and 
coordinate release schedules, etc

Examples of some generic APIs in arrow-rs's Parquet reader are:
1. Filter evaluation API (not it is not part of a query engine)[1]
2. PushDecoder to separate IO from parquet decoding[2]

Andrew

[1]: 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html
[2]: 
https://github.com/apache/arrow-rs/blob/fea605cb16f7524cb69a197bfa581a1d4f5fe5d0/parquet/src/arrow/push_decoder/mod.rs#L218-L233

On Wed, Nov 19, 2025 at 8:28 AM Ahmar Suhail 
<[email protected]<mailto:[email protected]>> wrote:
Hey everyone,

I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
contributing mainly to S3A. I would like to start a discussion on
collaborating on a single Apache level project, which will implement
parquet input stream level optimisations like readVectored() in a unified
place, rather than having vendor specific implementations.

Last year, my team started working on an analytics accelerator for S3
<https://github.com/awslabs/analytics-accelerator-s3> (AAL), with the goal
of improving query performance for Spark workloads by implementing client
side best practices. You can find more details about the project in this doc
<https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw>,
which was shared on the Iceberg mailing lists earlier this year, and the
Iceberg issue to integrate this as the default stream here
<https://github.com/apache/iceberg/issues/14350>.

The team at Google has gcs-analytics-core
<https://github.com/GoogleCloudPlatform/gcs-analytics-core> which
implements Parquet stream level optimizations, and was released in
September of this year, iceberg issue here
<https://github.com/apache/iceberg/issues/14326>.

Most parquet reader optimisations are not vendor specific, with the major
feature set required being:

   -  Parquet footer prefetching and caching - Prefetch the last X
   bytes  (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
   -  Vectored reads - Lets the parquet-reader pass in a list of columns
   that can be prefetched in parallel.
   - Sequential Prefetching - Useful for speeding up things where the whole
   Parquet object is going to be read eg: DistCP, and should help with
   compaction as well.


With this in mind, I would like to propose the following:

   - A new ASF project (top level or a sub project of the existing
   hadoop/iceberg projects).
   - Project has a goal of bringing stream reading best practices into one
   place. Eg: For parquet, it implements footer prefetching and caching,
   vectored reads etc.
   - Implements non-format specific best practices/optimisations: eg:
   Sequential prefetching and reading small objects in a single GET.
   - Is integrated into upstream projects like Iceberg and Hadoop as a
   replacement/alternative for the current input stream implementations.

We can structure it similar to how Hadoop and Iceberg are today:

   - A shared logical layer (think of it similar to hadoop-common), where
   the common logic goes. Ideally, 80% of the code ends up here
   (optimisations, memory management, thread pools etc.)
   - A  light vendor specific client layer (Kind of like the
   hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I
   imagine different cloud stores will have different requirements on things
   like optimal request sizes, concurrency and certain features that are not
   common.

Note: These are all high level ideas, influenced by the direction AAL has
taken in the last year, and perhaps there is a different, more optimal way
to this all together.

From TPC-DS benchmarking my team has done, there looks to be a 10% query
read performance gain that can be achieved through the above listed
optimisations, and through collaboration, we can likely drive this number
up further. For example, it would be great to discuss how Spark and the
Parquet reader can pass any additional information they have to the stream
(similar to vectored reads), which can help read performance.

In my opinion, there is a lot of opportunity here, and collaborating on a
single, shared ASF project helps us achieve it faster, both in terms of
adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long
term maintenance of libraries like these. It also gives us an opportunity
to combine our knowledge in this space, and react to upcoming changes in
the Parquet format.

If this sounds good, as a next step I can schedule a sync post thanksgiving
to brainstorm ideas and next steps.

Thank you, and looking forward to hearing your thoughts.

Ahmar

Re: [DISCUSS] Creating an Apache project for Parquet reader optimisations

Reply via email to