Hey everyone,

I'm part of the S3 team at AWS, and a PMC on the Hadoop project,
contributing mainly to S3A. I would like to start a discussion on
collaborating on a single Apache level project, which will implement
parquet input stream level optimisations like readVectored() in a unified
place, rather than having vendor specific implementations.

Last year, my team started working on an analytics accelerator for S3
<https://github.com/awslabs/analytics-accelerator-s3> (AAL), with the goal
of improving query performance for Spark workloads by implementing client
side best practices. You can find more details about the project in this doc
<https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw>,
which was shared on the Iceberg mailing lists earlier this year, and the
Iceberg issue to integrate this as the default stream here
<https://github.com/apache/iceberg/issues/14350>.

The team at Google has gcs-analytics-core
<https://github.com/GoogleCloudPlatform/gcs-analytics-core> which
implements Parquet stream level optimizations, and was released in
September of this year, iceberg issue here
<https://github.com/apache/iceberg/issues/14326>.

Most parquet reader optimisations are not vendor specific, with the major
feature set required being:

   -  Parquet footer prefetching and caching - Prefetch the last X
   bytes  (eg: 32KB) to avoid the "Parquet Footer dance" and cache them.
   -  Vectored reads - Lets the parquet-reader pass in a list of columns
   that can be prefetched in parallel.
   - Sequential Prefetching - Useful for speeding up things where the whole
   Parquet object is going to be read eg: DistCP, and should help with
   compaction as well.


With this in mind, I would like to propose the following:

   - A new ASF project (top level or a sub project of the existing
   hadoop/iceberg projects).
   - Project has a goal of bringing stream reading best practices into one
   place. Eg: For parquet, it implements footer prefetching and caching,
   vectored reads etc.
   - Implements non-format specific best practices/optimisations: eg:
   Sequential prefetching and reading small objects in a single GET.
   - Is integrated into upstream projects like Iceberg and Hadoop as a
   replacement/alternative for the current input stream implementations.

We can structure it similar to how Hadoop and Iceberg are today:

   - A shared logical layer (think of it similar to hadoop-common), where
   the common logic goes. Ideally, 80% of the code ends up here
   (optimisations, memory management, thread pools etc.)
   - A  light vendor specific client layer (Kind of like the
   hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I
   imagine different cloud stores will have different requirements on things
   like optimal request sizes, concurrency and certain features that are not
   common.

Note: These are all high level ideas, influenced by the direction AAL has
taken in the last year, and perhaps there is a different, more optimal way
to this all together.

>From TPC-DS benchmarking my team has done, there looks to be a 10% query
read performance gain that can be achieved through the above listed
optimisations, and through collaboration, we can likely drive this number
up further. For example, it would be great to discuss how Spark and the
Parquet reader can pass any additional information they have to the stream
(similar to vectored reads), which can help read performance.

In my opinion, there is a lot of opportunity here, and collaborating on a
single, shared ASF project helps us achieve it faster, both in terms of
adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long
term maintenance of libraries like these. It also gives us an opportunity
to combine our knowledge in this space, and react to upcoming changes in
the Parquet format.

If this sounds good, as a next step I can schedule a sync post thanksgiving
to brainstorm ideas and next steps.

Thank you, and looking forward to hearing your thoughts.

Ahmar

Reply via email to