Re: [DISCUSS] new Parquet footer experiments

Antoine Pitrou Thu, 22 Aug 2024 01:11:28 -0700


As a Apache Arrow developer myself, I find the analogy with Feather to
be flawed.


Feather (or, rather, Arrow IPC) is a vastly different format than
Parquet. It is useful primarily for transport (e.g. over the network,
using Arrow Flight RPC for example) or short-term storage (a
spillover cache of temporary data perhaps?). For more persistent
storage Parquet has a lot of qualities that are unmatched by Arrow IPC,
primarily in the size optimization department. Arrow IPC is really
a off-RAM materialization of the Arrow in-memory format, with
a little additional metadata to describe the schema.

But unlike Parquet vs. Arrow IPC, I don't think Thrift has many
qualities that would make it better than Flatbuffers for Parquet
metadata.

The main concern that may be useful to watch for is binary size of the
encoded metadata (and whether than can be alleviated using a simple and
fast compression algorithm, e.g. LZ4).

Regards

Antoine.



On Thu, 15 Aug 2024 15:27:13 +0100
Raphael Taylor-Davies
<r.taylordav...@googlemail.com.INVALID>
wrote:
> Right, I'm not disputing that parquet has a lot of additional functionality 
> that people could leverage compared to feather. However, many of the 
> use-cases that have been articulated to justify switching parquet to 
> flatbuffers don't benefit from many of these features. For example, ML 
> workloads often exhibit some combination of random access, fixed length 
> lists, no statistics or indexes, and sourcing data from fast NVMe storage 
> that is bottlenecked by more expensive encodings. Similarly many others are 
> making use of indexes maintained outside the files, obviating the need for 
> indexes and statistics within the files themselves.
> 
> As I stated, feather is not without flaws, but it seems wise to assess which 
> is closer to the desired feature set, rather than dismissing one out of hand 
> because its total feature set is smaller. This does of course entail 
> articulating what this feature set actually is, which may actually be the 
> core of my question...
> 
> On 15 August 2024 14:50:08 BST, Jan Finis <jpfi...@gmail.com> wrote:
> >(I don't want to come over as not liking Arrow; quite the opposite is the
> >case, it's amazing for IPC and we also use it for that. And I do think we
> >can and should learn from Arrow and there is a lot of things that would
> >make sense to adopt or at least to get closer too, e.g., its type system.)
> >
> >Am Do., 15. Aug. 2024 um 15:46 Uhr schrieb Jan Finis <jpfi...@gmail.com>:
> >  
> >> I  
> >>> wonder if there might be mileage in building off an existing format
> >>> instead of defining another flatbuffer based tabular encoding
> >>>  
> >>
> >> We *are* building off an existing format. Parquet will stay as it is, its
> >> logical structure and everything. The encoding used for this logical
> >> structure's metadata is just a detail. In contrast, Arrow (Feather) is
> >> missing so many things that Parquet has, so you would need a lot more work
> >> to get to where Parquet is, if you started off with Arrow.
> >>
> >>  I also wonder if, for example, feather added better encoding  
> >>> support, we would get something that met all the requirements, without
> >>> needing to effectively create a third file format?  
> >>
> >>
> >> I would argue that feather is faaaaar farther away from being a good
> >> storage format than Parquet is. Yes, you could add a lot of stuff to
> >> feather to make it a great storage format, but it would be way more than
> >> you would need to add to Parquet to make it as good. And you would also be
> >> creating de facto a third format if you change that much. That's also why
> >> Feather has little adoption: It's not a good storage format as of now, as
> >> its based on Arrow and Arrow isn't design for that.
> >>
> >> One thing that occurs to me from reading a number of the proposals /  
> >>> investigations in this space is how similar many of them end up looking
> >>> to Apache Arrow Feather / IPC.  
> >>
> >>
> >> I would argue that Parquet and Arrow are actually not that similar. They
> >> are similar, but only "trivially so" (i.e., in ways in which each storage
> >> format would end up at a roughly similar design):
> >>
> >> Obvious similarities:
> >> * Most modern formats are chunked columnar, as this is just obviously a
> >> good idea for many scenarios, as you can choose to skip columns and
> >> chunking helps with being able to write without having to buffer all the
> >> file before writing it. In that regard Parquet and Arrow are similar, but
> >> so are ORC and others.
> >> * Formats have some kind of metadata structure describing the file. In
> >> that regard, Parquet and Arrow and all others are similar, but every format
> >> needs this, so the similarity is trivial.
> >> * Both have encodings. Again, any format needs this. Also the fact that
> >> there is overlap in the encodings is just because some encodings are an
> >> obviously good idea (e.g., dictionary).
> >> * Both have a footer. Any format that wants to be able to be written in a
> >> streaming fashion needs this, you don't know the metadata of the file yet
> >> when you start writing, so you cannot use a header and putting the metadata
> >> somewhere in the middle would make it hard to discover.
> >>
> >> Now, the metadata structure needs to be encoded, somehow. For this, modern
> >> formats use encoding standards like thrift or flatbuffers. But I would
> >> argue that the actual encoding of the metadata is *not* what defines a
> >> format, at all; it is just a means to encode the logical structure, and
> >> while this structure has some resemblance between Arrow and Parquet, there
> >> are actually a lot of differences. Each format somehow has to do it and
> >> just because two formats may decide to use the same encoding (e.g.,
> >> flatbuffers) that doesn't make them very similar.
> >>
> >> But here is the greatest dissimilarity:
> >> Arrow was made to be an IPC streaming format. As such, it needs to be very
> >> fast to read and write. Then the python guys came and invented feather to
> >> make the whole thing also work somewhat as a storage format, but all they
> >> did was basically add a footer; the format itself (i.e., the encodings) are
> >> not at all optimized for a storage format. A storage format may / should
> >> use more heavyweight encodings and it is okay if it is more costly to
> >> write, as you expect it to be read multiple times while an IPC format is
> >> usually only read once.
> >>
> >> Could Arrow become more of a storage format by adding more heavyweight
> >> encodings, statistics, and indexes? It sure could! And maybe it will, but
> >> that's not what it was designed for, so I doubt that such things will be a
> >> high priority for Arrow. But I wouldn't - at least as of now - want to base
> >> a storage format on Arrow. There is just too much missing. A lot of things
> >> that Parquet has that Arrow hasn't (all the statistics, bloom filters,
> >> indexes, etc.). It may look like Parquet is becoming Arrow if it adopts
> >> flatbuffers as a metadata format, but I would argue that is not at all the
> >> case. The metadata encoding is just a tiny detail and a lot of other things
> >> are what makes Parquet Parquet.
> >>
> >> Cheers,
> >> Jan
> >>
> >>
> >>
> >>
> >> Am Do., 15. Aug. 2024 um 15:20 Uhr schrieb Raphael Taylor-Davies
> >> <r.taylordav...@googlemail.com.invalid>:
> >>  
> >>> Hi,
> >>>
> >>> One thing that occurs to me from reading a number of the proposals /
> >>> investigations in this space is how similar many of them end up looking
> >>> to Apache Arrow Feather / IPC. Parquet does have a narrower type system
> >>> and broader encoding and statistics support, but as far as the
> >>> underlying file structure is concerned feather looks a lot like many of
> >>> the parquet v3 proposals.
> >>>
> >>> Despite this feather has not seen especially broad adoption as far as I
> >>> am aware, and I therefore wonder if there might be relevant learnings
> >>> here? I also wonder if, for example, feather added better encoding
> >>> support, we would get something that met all the requirements, without
> >>> needing to effectively create a third file format?
> >>>
> >>> To be clear I have no particular affection for feather, in fact I find
> >>> the way it handles dictionaries to be especially distasteful, but I
> >>> wonder if there might be mileage in building off an existing format
> >>> instead of defining another flatbuffer based tabular encoding...
> >>>
> >>> Kind Regards,
> >>>
> >>> Raphael
> >>>
> >>> On 15/08/2024 13:41, Jan Finis wrote:  
> >>> > I guess most close source implementations have done these optimizations
> >>> > already, it has just not been done in the open source versions. E.g., we
> >>> > switched to a custom-built thrift runtime using pool allocators and  
> >>> string  
> >>> > views instead of copied strings a few years ago, seeing comparable
> >>> > speed-ups. The C++ thrift library is just horribly inefficient.
> >>> >
> >>> > I agree with Alkis though that there are some gains that can be  
> >>> achieved by  
> >>> > optimizing, but the format has inherent drawbacks. Flatbuffers is indeed
> >>> > more efficient but at the cost of increased size.
> >>> > Alkis, can you elaborate how you brought the size of Flatbuffers down?
> >>> >
> >>> > Cheers,
> >>> > Jan
> >>> >
> >>> > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb <  
> >>> > andrewlam...@gmail.com>:  
> >>> >  
> >>> >> I don't disagree that flatbuffers would be faster than thrift decoding
> >>> >>
> >>> >> I am trying to say that with software engineering only (no change to  
> >>> the  
> >>> >> format) it is likely possible to increase parquet thrift metadata  
> >>> parsing  
> >>> >> speed by 4x.
> >>> >>
> >>> >> This is not 25x of course, but 4x is non trivial.
> >>> >>
> >>> >> The fact that no one yet has bothered to invest the time to get the 4x 
> >>> >>  
> >>> yet  
> >>> >> in open source implementations of parquet suggests to me that the  
> >>> parsing  
> >>> >> time may not be as critical an issue as we think
> >>> >>
> >>> >> Andrew
> >>> >>
> >>> >> On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos
> >>> >> <alkis.evlogime...@databricks.com.invalid> wrote:
> >>> >>  
> >>> >>> The difference in parsing speed between thrift and flatbuffer is >25x.
> >>> >>> Thrift has some fundamental design decisions that make decoding slow:
> >>> >>> 1. the thrift compact protocol is very data dependent: uleb encoding  
> >>> for  
> >>> >>> integers, field ids are deltas from previous. The data dependencies
> >>> >>> disallow pipelining of modern cpus
> >>> >>> 2. object model does not have a way to use arenas to avoid many  
> >>> >> allocations  
> >>> >>> of objects
> >>> >>> If we keep thrift, we can potentially get 2 fixed, but fixing 1  
> >>> requires  
> >>> >>> changes to the thrift serialization protocol. Such a change is not
> >>> >>> different from switching serialization format.
> >>> >>>
> >>> >>>
> >>> >>> On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <andrewlam...@gmail.com>
> >>> >>> wrote:
> >>> >>>  
> >>> >>>> I wanted to share some work Xiangpeng Hao did at InfluxData this  
> >>> summer  
> >>> >>> on  
> >>> >>>> the current (thrift) metadata format[1].
> >>> >>>>
> >>> >>>> We found that with careful software engineering, we could likely  
> >>> >> improve  
> >>> >>>> the speed of reading existing parquet footer format by a factor of 4 
> >>> >>>>  
> >>> or  
> >>> >>>> more ([2] contains some specific ideas). While we analyzed the
> >>> >>>> Rust implementation, I believe a similar conclusion applies to C/C++.
> >>> >>>>
> >>> >>>> I realize that there are certain features that switching to an  
> >>> entirely  
> >>> >>> new  
> >>> >>>> footer format would achieve, but the cost to adopting a new format
> >>> >>>> across the ecosystem is immense (e.g. Parquet "version 2.0" etc).
> >>> >>>>
> >>> >>>> It is my opinion that investing the same effort in software  
> >>> >> optimization  
> >>> >>>> that would be required for a new footer format would have a much  
> >>> bigger  
> >>> >>>> impact
> >>> >>>>
> >>> >>>> Andrew
> >>> >>>>
> >>> >>>> [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/
> >>> >>>> [2]: https://github.com/apache/arrow-rs/issues/5853
> >>> >>>>
> >>> >>>> On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos
> >>> >>>> <alkis.evlogime...@databricks.com.invalid> wrote:
> >>> >>>>  
> >>> >>>>> Hi Julien.
> >>> >>>>>
> >>> >>>>> Thank you for reconnecting the threads.
> >>> >>>>>
> >>> >>>>> I have broken down my experiments in a narrative, commit by commit  
> >>> on  
> >>> >>> how  
> >>> >>>>> we can go from flatbuffers being ~2x larger than thrift to being  
> >>> >>> smaller  
> >>> >>>>> (and at times even half) the size of thrift. This is still on an  
> >>> >>> internal  
> >>> >>>>> branch, I will resume work towards the end of this month to port it 
> >>> >>>>>  
> >>> >> to  
> >>> >>>>> arrow so that folks can look at the progress and share ideas.
> >>> >>>>>
> >>> >>>>> On the benchmarking front I need to build and share a binary for  
> >>> >> third  
> >>> >>>>> parties to donate their footers for analysis.
> >>> >>>>>
> >>> >>>>> The PR for parquet extensions has gotten a few rounds of reviews:
> >>> >>>>> https://github.com/apache/parquet-format/pull/254. I hope it will  
> >>> be  
> >>> >>>>> merged
> >>> >>>>> soon.
> >>> >>>>>
> >>> >>>>> I missed the sync yesterday - for some reason I didn't receive an
> >>> >>>>> invitation. Julien could you add me again to the invite list?
> >>> >>>>>
> >>> >>>>> On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org>  
> >>> >>> wrote:  
> >>> >>>>>> This came up in the sync today.
> >>> >>>>>>
> >>> >>>>>> There are a few concurrent experiments with flatbuffers for a  
> >>> >> future  
> >>> >>>>>> Parquet footer replacement. In itself it is fine and just wanted to
> >>> >>>>>> reconnect the threads here so that folks are aware of each other  
> >>> >> and  
> >>> >>>> can  
> >>> >>>>>> share findings.
> >>> >>>>>>
> >>> >>>>>> - Neelaksh benchmarking and experiments:
> >>> >>>>>>
> >>> >>>>>>  
> >>> >>  
> >>> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1
> >>>   
> >>> >>>>>> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking
> >>> >>>>>>
> >>> >>>>>> - Alkis has also been experimenting and led the proposal for  
> >>> >> enabling  
> >>> >>>>>> extending the existing footer.
> >>> >>>>>>
> >>> >>>>>>  
> >>> >>  
> >>> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6
> >>>   
> >>> >>>>>> - Xuwei also shared that he is looking into this.
> >>> >>>>>>
> >>> >>>>>> I would suggest that you all reply to this thread sharing your  
> >>> >>> current  
> >>> >>>>>> progress or ideas and a link to your respective repos for  
> >>> >>>> experimenting.  
> >>> >>>>>> Best
> >>> >>>>>> Julien
> >>> >>>>>>  
> >>>  
> >>  
>

Re: [DISCUSS] new Parquet footer experiments

Reply via email to