Re: [DISCUSS] new Parquet footer experiments

Jan Finis Thu, 15 Aug 2024 06:51:04 -0700

(I don't want to come over as not liking Arrow; quite the opposite is the
case, it's amazing for IPC and we also use it for that. And I do think we
can and should learn from Arrow and there is a lot of things that would
make sense to adopt or at least to get closer too, e.g., its type system.)


Am Do., 15. Aug. 2024 um 15:46 Uhr schrieb Jan Finis <jpfi...@gmail.com>:

> I
>> wonder if there might be mileage in building off an existing format
>> instead of defining another flatbuffer based tabular encoding
>>
>
> We *are* building off an existing format. Parquet will stay as it is, its
> logical structure and everything. The encoding used for this logical
> structure's metadata is just a detail. In contrast, Arrow (Feather) is
> missing so many things that Parquet has, so you would need a lot more work
> to get to where Parquet is, if you started off with Arrow.
>
>  I also wonder if, for example, feather added better encoding
>> support, we would get something that met all the requirements, without
>> needing to effectively create a third file format?
>
>
> I would argue that feather is faaaaar farther away from being a good
> storage format than Parquet is. Yes, you could add a lot of stuff to
> feather to make it a great storage format, but it would be way more than
> you would need to add to Parquet to make it as good. And you would also be
> creating de facto a third format if you change that much. That's also why
> Feather has little adoption: It's not a good storage format as of now, as
> its based on Arrow and Arrow isn't design for that.
>
> One thing that occurs to me from reading a number of the proposals /
>> investigations in this space is how similar many of them end up looking
>> to Apache Arrow Feather / IPC.
>
>
> I would argue that Parquet and Arrow are actually not that similar. They
> are similar, but only "trivially so" (i.e., in ways in which each storage
> format would end up at a roughly similar design):
>
> Obvious similarities:
> * Most modern formats are chunked columnar, as this is just obviously a
> good idea for many scenarios, as you can choose to skip columns and
> chunking helps with being able to write without having to buffer all the
> file before writing it. In that regard Parquet and Arrow are similar, but
> so are ORC and others.
> * Formats have some kind of metadata structure describing the file. In
> that regard, Parquet and Arrow and all others are similar, but every format
> needs this, so the similarity is trivial.
> * Both have encodings. Again, any format needs this. Also the fact that
> there is overlap in the encodings is just because some encodings are an
> obviously good idea (e.g., dictionary).
> * Both have a footer. Any format that wants to be able to be written in a
> streaming fashion needs this, you don't know the metadata of the file yet
> when you start writing, so you cannot use a header and putting the metadata
> somewhere in the middle would make it hard to discover.
>
> Now, the metadata structure needs to be encoded, somehow. For this, modern
> formats use encoding standards like thrift or flatbuffers. But I would
> argue that the actual encoding of the metadata is *not* what defines a
> format, at all; it is just a means to encode the logical structure, and
> while this structure has some resemblance between Arrow and Parquet, there
> are actually a lot of differences. Each format somehow has to do it and
> just because two formats may decide to use the same encoding (e.g.,
> flatbuffers) that doesn't make them very similar.
>
> But here is the greatest dissimilarity:
> Arrow was made to be an IPC streaming format. As such, it needs to be very
> fast to read and write. Then the python guys came and invented feather to
> make the whole thing also work somewhat as a storage format, but all they
> did was basically add a footer; the format itself (i.e., the encodings) are
> not at all optimized for a storage format. A storage format may / should
> use more heavyweight encodings and it is okay if it is more costly to
> write, as you expect it to be read multiple times while an IPC format is
> usually only read once.
>
> Could Arrow become more of a storage format by adding more heavyweight
> encodings, statistics, and indexes? It sure could! And maybe it will, but
> that's not what it was designed for, so I doubt that such things will be a
> high priority for Arrow. But I wouldn't - at least as of now - want to base
> a storage format on Arrow. There is just too much missing. A lot of things
> that Parquet has that Arrow hasn't (all the statistics, bloom filters,
> indexes, etc.). It may look like Parquet is becoming Arrow if it adopts
> flatbuffers as a metadata format, but I would argue that is not at all the
> case. The metadata encoding is just a tiny detail and a lot of other things
> are what makes Parquet Parquet.
>
> Cheers,
> Jan
>
>
>
>
> Am Do., 15. Aug. 2024 um 15:20 Uhr schrieb Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid>:
>
>> Hi,
>>
>> One thing that occurs to me from reading a number of the proposals /
>> investigations in this space is how similar many of them end up looking
>> to Apache Arrow Feather / IPC. Parquet does have a narrower type system
>> and broader encoding and statistics support, but as far as the
>> underlying file structure is concerned feather looks a lot like many of
>> the parquet v3 proposals.
>>
>> Despite this feather has not seen especially broad adoption as far as I
>> am aware, and I therefore wonder if there might be relevant learnings
>> here? I also wonder if, for example, feather added better encoding
>> support, we would get something that met all the requirements, without
>> needing to effectively create a third file format?
>>
>> To be clear I have no particular affection for feather, in fact I find
>> the way it handles dictionaries to be especially distasteful, but I
>> wonder if there might be mileage in building off an existing format
>> instead of defining another flatbuffer based tabular encoding...
>>
>> Kind Regards,
>>
>> Raphael
>>
>> On 15/08/2024 13:41, Jan Finis wrote:
>> > I guess most close source implementations have done these optimizations
>> > already, it has just not been done in the open source versions. E.g., we
>> > switched to a custom-built thrift runtime using pool allocators and
>> string
>> > views instead of copied strings a few years ago, seeing comparable
>> > speed-ups. The C++ thrift library is just horribly inefficient.
>> >
>> > I agree with Alkis though that there are some gains that can be
>> achieved by
>> > optimizing, but the format has inherent drawbacks. Flatbuffers is indeed
>> > more efficient but at the cost of increased size.
>> > Alkis, can you elaborate how you brought the size of Flatbuffers down?
>> >
>> > Cheers,
>> > Jan
>> >
>> > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <
>> > andrewlam...@gmail.com>:
>> >
>> >> I don't disagree that flatbuffers would be faster than thrift decoding
>> >>
>> >> I am trying to say that with software engineering only (no change to
>> the
>> >> format) it is likely possible to increase parquet thrift metadata
>> parsing
>> >> speed by 4x.
>> >>
>> >> This is not 25x of course, but 4x is non trivial.
>> >>
>> >> The fact that no one yet has bothered to invest the time to get the 4x
>> yet
>> >> in open source implementations of parquet suggests to me that the
>> parsing
>> >> time may not be as critical an issue as we think
>> >>
>> >> Andrew
>> >>
>> >> On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
>> >> <alkis.evlogime...@databricks.com.invalid> wrote:
>> >>
>> >>> The difference in parsing speed between thrift and flatbuffer is >25x.
>> >>> Thrift has some fundamental design decisions that make decoding slow:
>> >>> 1. the thrift compact protocol is very data dependent: uleb encoding
>> for
>> >>> integers, field ids are deltas from previous. The data dependencies
>> >>> disallow pipelining of modern cpus
>> >>> 2. object model does not have a way to use arenas to avoid many
>> >> allocations
>> >>> of objects
>> >>> If we keep thrift, we can potentially get 2 fixed, but fixing 1
>> requires
>> >>> changes to the thrift serialization protocol. Such a change is not
>> >>> different from switching serialization format.
>> >>>
>> >>>
>> >>> On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <andrewlam...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> I wanted to share some work Xiangpeng Hao did at InfluxData this
>> summer
>> >>> on
>> >>>> the current (thrift) metadata format[1].
>> >>>>
>> >>>> We found that with careful software engineering, we could likely
>> >> improve
>> >>>> the speed of reading existing parquet footer format by a factor of 4
>> or
>> >>>> more ([2] contains some specific ideas). While we analyzed the
>> >>>> Rust implementation, I believe a similar conclusion applies to C/C++.
>> >>>>
>> >>>> I realize that there are certain features that switching to an
>> entirely
>> >>> new
>> >>>> footer format would achieve, but the cost to adopting a new format
>> >>>> across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
>> >>>>
>> >>>> It is my opinion that investing the same effort in software
>> >> optimization
>> >>>> that would be required for a new footer format would have a much
>> bigger
>> >>>> impact
>> >>>>
>> >>>> Andrew
>> >>>>
>> >>>> [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
>> >>>> [2]: https://github.com/apache/arrow-rs/issues/5853
>> >>>>
>> >>>> On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
>> >>>> <alkis.evlogime...@databricks.com.invalid> wrote:
>> >>>>
>> >>>>> Hi Julien.
>> >>>>>
>> >>>>> Thank you for reconnecting the threads.
>> >>>>>
>> >>>>> I have broken down my experiments in a narrative, commit by commit
>> on
>> >>> how
>> >>>>> we can go from flatbuffers being ~2x larger than thrift to being
>> >>> smaller
>> >>>>> (and at times even half) the size of thrift. This is still on an
>> >>> internal
>> >>>>> branch, I will resume work towards the end of this month to port it
>> >> to
>> >>>>> arrow so that folks can look at the progress and share ideas.
>> >>>>>
>> >>>>> On the benchmarking front I need to build and share a binary for
>> >> third
>> >>>>> parties to donate their footers for analysis.
>> >>>>>
>> >>>>> The PR for parquet extensions has gotten a few rounds of reviews:
>> >>>>> https://github.com/apache/parquet-format/pull/254. I hope it will
>> be
>> >>>>> merged
>> >>>>> soon.
>> >>>>>
>> >>>>> I missed the sync yesterday - for some reason I didn't receive an
>> >>>>> invitation. Julien could you add me again to the invite list?
>> >>>>>
>> >>>>> On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org>
>> >>> wrote:
>> >>>>>> This came up in the sync today.
>> >>>>>>
>> >>>>>> There are a few concurrent experiments with flatbuffers for a
>> >> future
>> >>>>>> Parquet footer replacement. In itself it is fine and just wanted to
>> >>>>>> reconnect the threads here so that folks are aware of each other
>> >> and
>> >>>> can
>> >>>>>> share findings.
>> >>>>>>
>> >>>>>> - Neelaksh benchmarking and experiments:
>> >>>>>>
>> >>>>>>
>> >>
>> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
>> >>>>>> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
>> >>>>>>
>> >>>>>> - Alkis has also been experimenting and led the proposal for
>> >> enabling
>> >>>>>> extending the existing footer.
>> >>>>>>
>> >>>>>>
>> >>
>> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
>> >>>>>> - Xuwei also shared that he is looking into this.
>> >>>>>>
>> >>>>>> I would suggest that you all reply to this thread sharing your
>> >>> current
>> >>>>>> progress or ideas and a link to your respective repos for
>> >>>> experimenting.
>> >>>>>> Best
>> >>>>>> Julien
>> >>>>>>
>>
>

Re: [DISCUSS] new Parquet footer experiments

Reply via email to