Hey everyone, I'm part of the S3 team at AWS, and a PMC on the Hadoop project, contributing mainly to S3A. I would like to start a discussion on collaborating on a single Apache level project, which will implement parquet input stream level optimisations like readVectored() in a unified place, rather than having vendor specific implementations.
Last year, my team started working on an analytics accelerator for S3 <https://github.com/awslabs/analytics-accelerator-s3> (AAL), with the goal of improving query performance for Spark workloads by implementing client side best practices. You can find more details about the project in this doc <https://docs.google.com/document/d/13shy0RWotwfWC_qQksb95PXdi-vSUCKQyDzjoExQEN0/edit?tab=t.0#heading=h.3lc3p7s26rnw>, which was shared on the Iceberg mailing lists earlier this year, and the Iceberg issue to integrate this as the default stream here <https://github.com/apache/iceberg/issues/14350>. The team at Google has gcs-analytics-core <https://github.com/GoogleCloudPlatform/gcs-analytics-core> which implements Parquet stream level optimizations, and was released in September of this year, iceberg issue here <https://github.com/apache/iceberg/issues/14326>. Most parquet reader optimisations are not vendor specific, with the major feature set required being: - Parquet footer prefetching and caching - Prefetch the last X bytes (eg: 32KB) to avoid the "Parquet Footer dance" and cache them. - Vectored reads - Lets the parquet-reader pass in a list of columns that can be prefetched in parallel. - Sequential Prefetching - Useful for speeding up things where the whole Parquet object is going to be read eg: DistCP, and should help with compaction as well. With this in mind, I would like to propose the following: - A new ASF project (top level or a sub project of the existing hadoop/iceberg projects). - Project has a goal of bringing stream reading best practices into one place. Eg: For parquet, it implements footer prefetching and caching, vectored reads etc. - Implements non-format specific best practices/optimisations: eg: Sequential prefetching and reading small objects in a single GET. - Is integrated into upstream projects like Iceberg and Hadoop as a replacement/alternative for the current input stream implementations. We can structure it similar to how Hadoop and Iceberg are today: - A shared logical layer (think of it similar to hadoop-common), where the common logic goes. Ideally, 80% of the code ends up here (optimisations, memory management, thread pools etc.) - A light vendor specific client layer (Kind of like the hadoop-aws/gcp/abfs modules), where any store specific logic ends up. I imagine different cloud stores will have different requirements on things like optimal request sizes, concurrency and certain features that are not common. Note: These are all high level ideas, influenced by the direction AAL has taken in the last year, and perhaps there is a different, more optimal way to this all together. >From TPC-DS benchmarking my team has done, there looks to be a 10% query read performance gain that can be achieved through the above listed optimisations, and through collaboration, we can likely drive this number up further. For example, it would be great to discuss how Spark and the Parquet reader can pass any additional information they have to the stream (similar to vectored reads), which can help read performance. In my opinion, there is a lot of opportunity here, and collaborating on a single, shared ASF project helps us achieve it faster, both in terms of adoption across upstream projects (eg: Hadoop, Iceberg, Trino), and long term maintenance of libraries like these. It also gives us an opportunity to combine our knowledge in this space, and react to upcoming changes in the Parquet format. If this sounds good, as a next step I can schedule a sync post thanksgiving to brainstorm ideas and next steps. Thank you, and looking forward to hearing your thoughts. Ahmar
