Re: [DISCUSS] new Parquet footer experiments

Raphael Taylor-Davies Thu, 15 Aug 2024 06:20:49 -0700

Hi,

One thing that occurs to me from reading a number of the proposals /investigations in this space is how similar many of them end up lookingto Apache Arrow Feather / IPC. Parquet does have a narrower type systemand broader encoding and statistics support, but as far as theunderlying file structure is concerned feather looks a lot like many ofthe parquet v3 proposals.

Despite this feather has not seen especially broad adoption as far as Iam aware, and I therefore wonder if there might be relevant learningshere? I also wonder if, for example, feather added better encodingsupport, we would get something that met all the requirements, withoutneeding to effectively create a third file format?

To be clear I have no particular affection for feather, in fact I findthe way it handles dictionaries to be especially distasteful, but Iwonder if there might be mileage in building off an existing formatinstead of defining another flatbuffer based tabular encoding...


Kind Regards,

Raphael

On 15/08/2024 13:41, Jan Finis wrote:

I guess most close source implementations have done these optimizations
already, it has just not been done in the open source versions. E.g., we
switched to a custom-built thrift runtime using pool allocators and string
views instead of copied strings a few years ago, seeing comparable
speed-ups. The C++ thrift library is just horribly inefficient.

I agree with Alkis though that there are some gains that can be achieved by
optimizing, but the format has inherent drawbacks. Flatbuffers is indeed
more efficient but at the cost of increased size.
Alkis, can you elaborate how you brought the size of Flatbuffers down?

Cheers,
Jan

Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <
andrewlam...@gmail.com>:

I don't disagree that flatbuffers would be faster than thrift decoding

I am trying to say that with software engineering only (no change to the
format) it is likely possible to increase parquet thrift metadata parsing
speed by 4x.

This is not 25x of course, but 4x is non trivial.

The fact that no one yet has bothered to invest the time to get the 4x yet
in open source implementations of parquet suggests to me that the parsing
time may not be as critical an issue as we think

Andrew

On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

The difference in parsing speed between thrift and flatbuffer is >25x.
Thrift has some fundamental design decisions that make decoding slow:
1. the thrift compact protocol is very data dependent: uleb encoding for
integers, field ids are deltas from previous. The data dependencies
disallow pipelining of modern cpus
2. object model does not have a way to use arenas to avoid many

allocations

of objects
If we keep thrift, we can potentially get 2 fixed, but fixing 1 requires
changes to the thrift serialization protocol. Such a change is not
different from switching serialization format.


On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <andrewlam...@gmail.com>
wrote:

I wanted to share some work Xiangpeng Hao did at InfluxData this summer

on

the current (thrift) metadata format[1].

We found that with careful software engineering, we could likely

improve

the speed of reading existing parquet footer format by a factor of 4 or
more ([2] contains some specific ideas). While we analyzed the
Rust implementation, I believe a similar conclusion applies to C/C++.

I realize that there are certain features that switching to an entirely

new

footer format would achieve, but the cost to adopting a new format
across the ecosystem is immense (e.g. Parquet "version 2.0" etc).

It is my opinion that investing the same effort in software

optimization

that would be required for a new footer format would have a much bigger
impact

Andrew

[1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
[2]: https://github.com/apache/arrow-rs/issues/5853

On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

Hi Julien.

Thank you for reconnecting the threads.

I have broken down my experiments in a narrative, commit by commit on

how

we can go from flatbuffers being ~2x larger than thrift to being

smaller

(and at times even half) the size of thrift. This is still on an

internal

branch, I will resume work towards the end of this month to port it

to

arrow so that folks can look at the progress and share ideas.

On the benchmarking front I need to build and share a binary for

third

parties to donate their footers for analysis.

The PR for parquet extensions has gotten a few rounds of reviews:
https://github.com/apache/parquet-format/pull/254. I hope it will be
merged
soon.

I missed the sync yesterday - for some reason I didn't receive an
invitation. Julien could you add me again to the invite list?

On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org>

wrote:

This came up in the sync today.

There are a few concurrent experiments with flatbuffers for a

future

Parquet footer replacement. In itself it is fine and just wanted to
reconnect the threads here so that folks are aware of each other

and

can

share findings.

- Neelaksh benchmarking and experiments:

https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1

https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking

- Alkis has also been experimenting and led the proposal for

enabling

extending the existing footer.

https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6

- Xuwei also shared that he is looking into this.

I would suggest that you all reply to this thread sharing your

current

progress or ideas and a link to your respective repos for

experimenting.

Best
Julien

Re: [DISCUSS] new Parquet footer experiments

Reply via email to