Re: [DISCUSS] new Parquet footer experiments

Neelaksh Singh Thu, 15 Aug 2024 07:52:33 -0700

Hello everyone. My name is Neelaksh. I am currently collaborating with
G-Research as a Software Engineer Intern on conducting extensive
benchmarking of Apache Parquet-CPP performance, focusing on metadata
handling, data read operations, schema processing, and compression
algorithms. We analyzed how factors such as the number of columns,
chunking, paging, and statistics levels impact metadata decode time, file
size, and read performance. These results provide valuable insights into
Parquet's performance characteristics for wide tables, helping to identify
optimal configurations for different use cases and potential areas for
optimization in Parquet file handling and processing.


As a next step I will be working on Flatbuffers benchmarking. Currently I
have implemented a benchmarking suite for comparing Parquet metadata
handling using Thrift and FlatBuffer encodings. It generates test Parquet
files, converts metadata between Thrift and FlatBuffer formats, and
measures parsing times for both encodings. The suite includes benchmarks
for parsing Thrift metadata, encoding and parsing FlatBuffer metadata, and
a combined approach that appends FlatBuffer data to Thrift metadata. The
benchmarks are configured to run with different column counts (3000 and
2000) to evaluate performance across varying metadata sizes, ultimately
aiming to assess the potential benefits of using FlatBuffer as an
alternative or extension to the current Thrift-based metadata in Parquet
files.

 Links:

 1. Medium Post -
https://neelaksh-singh.medium.com/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
 2. Reproducible Repo -
https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
 3. Documentation (Ongoing) -
https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking/blob/main/Parquet-CPP-Benchmarking.ipynb
 4. Flabuffer Code -
https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking/blob/main/src/pq_fb_ns_data_generator.cc


Things are still WIP. I am open for collaboration and look forward to your
suggestions.

On Thu, Aug 15, 2024, 6:29 PM Neelaksh Singh <neelaks...@gmail.com> wrote:

> Hello everyone. My name is Neelaksh. I am currently collaborating with
> G-Research as a Software Engineer Intern on conducting extensive
> benchmarking of Apache Parquet-CPP performance, focusing on metadata
> handling, data read operations, schema processing, and compression
> algorithms. We analyzed how factors such as the number of columns,
> chunking, paging, and statistics levels impact metadata decode time, file
> size, and read performance. These results provide valuable insights into
> Parquet's performance characteristics for wide tables, helping to identify
> optimal configurations for different use cases and potential areas for
> optimization in Parquet file handling and processing.
>
> As a next step I will be working on Flatbuffers benchmarking. Currently I
> have implemented a benchmarking suite for comparing Parquet metadata
> handling using Thrift and FlatBuffer encodings. It generates test Parquet
> files, converts metadata between Thrift and FlatBuffer formats, and
> measures parsing times for both encodings. The suite includes benchmarks
> for parsing Thrift metadata, encoding and parsing FlatBuffer metadata, and
> a combined approach that appends FlatBuffer data to Thrift metadata. The
> benchmarks are configured to run with different column counts (3000 and
> 2000) to evaluate performance across varying metadata sizes, ultimately
> aiming to assess the potential benefits of using FlatBuffer as an
> alternative or extension to the current Thrift-based metadata in Parquet
> files.
>
>  Links:
>
>  1. Medium Post -
> https://neelaksh-singh.medium.com/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
>  2. Reproducible Repo -
> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
>  3. Documentation (Ongoing) -
> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking/blob/main/Parquet-CPP-Benchmarking.ipynb
>  4. Flabuffer Code -
> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking/blob/main/src/pq_fb_ns_data_generator.cc
>
>
> Things are still WIP. I am open for collaboration and look forward to your
> suggestions.
>
> On Thu, Aug 15, 2024, 5:02 AM Julien Le Dem <jul...@apache.org> wrote:
>
>> This came up in the sync today.
>>
>> There are a few concurrent experiments with flatbuffers for a future
>> Parquet footer replacement. In itself it is fine and just wanted to
>> reconnect the threads here so that folks are aware of each other and can
>> share findings.
>>
>> - Neelaksh benchmarking and experiments:
>>
>> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
>> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
>>
>> - Alkis has also been experimenting and led the proposal for enabling
>> extending the existing footer.
>>
>> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
>>
>> - Xuwei also shared that he is looking into this.
>>
>> I would suggest that you all reply to this thread sharing your current
>> progress or ideas and a link to your respective repos for experimenting.
>>
>> Best
>> Julien
>>
>

Re: [DISCUSS] new Parquet footer experiments

Reply via email to