(I don't want to come over as not liking Arrow; quite the opposite is the case, it's amazing for IPC and we also use it for that. And I do think we can and should learn from Arrow and there is a lot of things that would make sense to adopt or at least to get closer too, e.g., its type system.)
Am Do., 15. Aug. 2024 um 15:46 Uhr schrieb Jan Finis <jpfi...@gmail.com>: > I >> wonder if there might be mileage in building off an existing format >> instead of defining another flatbuffer based tabular encoding >> > > We *are* building off an existing format. Parquet will stay as it is, its > logical structure and everything. The encoding used for this logical > structure's metadata is just a detail. In contrast, Arrow (Feather) is > missing so many things that Parquet has, so you would need a lot more work > to get to where Parquet is, if you started off with Arrow. > > I also wonder if, for example, feather added better encoding >> support, we would get something that met all the requirements, without >> needing to effectively create a third file format? > > > I would argue that feather is faaaaar farther away from being a good > storage format than Parquet is. Yes, you could add a lot of stuff to > feather to make it a great storage format, but it would be way more than > you would need to add to Parquet to make it as good. And you would also be > creating de facto a third format if you change that much. That's also why > Feather has little adoption: It's not a good storage format as of now, as > its based on Arrow and Arrow isn't design for that. > > One thing that occurs to me from reading a number of the proposals / >> investigations in this space is how similar many of them end up looking >> to Apache Arrow Feather / IPC. > > > I would argue that Parquet and Arrow are actually not that similar. They > are similar, but only "trivially so" (i.e., in ways in which each storage > format would end up at a roughly similar design): > > Obvious similarities: > * Most modern formats are chunked columnar, as this is just obviously a > good idea for many scenarios, as you can choose to skip columns and > chunking helps with being able to write without having to buffer all the > file before writing it. In that regard Parquet and Arrow are similar, but > so are ORC and others. > * Formats have some kind of metadata structure describing the file. In > that regard, Parquet and Arrow and all others are similar, but every format > needs this, so the similarity is trivial. > * Both have encodings. Again, any format needs this. Also the fact that > there is overlap in the encodings is just because some encodings are an > obviously good idea (e.g., dictionary). > * Both have a footer. Any format that wants to be able to be written in a > streaming fashion needs this, you don't know the metadata of the file yet > when you start writing, so you cannot use a header and putting the metadata > somewhere in the middle would make it hard to discover. > > Now, the metadata structure needs to be encoded, somehow. For this, modern > formats use encoding standards like thrift or flatbuffers. But I would > argue that the actual encoding of the metadata is *not* what defines a > format, at all; it is just a means to encode the logical structure, and > while this structure has some resemblance between Arrow and Parquet, there > are actually a lot of differences. Each format somehow has to do it and > just because two formats may decide to use the same encoding (e.g., > flatbuffers) that doesn't make them very similar. > > But here is the greatest dissimilarity: > Arrow was made to be an IPC streaming format. As such, it needs to be very > fast to read and write. Then the python guys came and invented feather to > make the whole thing also work somewhat as a storage format, but all they > did was basically add a footer; the format itself (i.e., the encodings) are > not at all optimized for a storage format. A storage format may / should > use more heavyweight encodings and it is okay if it is more costly to > write, as you expect it to be read multiple times while an IPC format is > usually only read once. > > Could Arrow become more of a storage format by adding more heavyweight > encodings, statistics, and indexes? It sure could! And maybe it will, but > that's not what it was designed for, so I doubt that such things will be a > high priority for Arrow. But I wouldn't - at least as of now - want to base > a storage format on Arrow. There is just too much missing. A lot of things > that Parquet has that Arrow hasn't (all the statistics, bloom filters, > indexes, etc.). It may look like Parquet is becoming Arrow if it adopts > flatbuffers as a metadata format, but I would argue that is not at all the > case. The metadata encoding is just a tiny detail and a lot of other things > are what makes Parquet Parquet. > > Cheers, > Jan > > > > > Am Do., 15. Aug. 2024 um 15:20 Uhr schrieb Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid>: > >> Hi, >> >> One thing that occurs to me from reading a number of the proposals / >> investigations in this space is how similar many of them end up looking >> to Apache Arrow Feather / IPC. Parquet does have a narrower type system >> and broader encoding and statistics support, but as far as the >> underlying file structure is concerned feather looks a lot like many of >> the parquet v3 proposals. >> >> Despite this feather has not seen especially broad adoption as far as I >> am aware, and I therefore wonder if there might be relevant learnings >> here? I also wonder if, for example, feather added better encoding >> support, we would get something that met all the requirements, without >> needing to effectively create a third file format? >> >> To be clear I have no particular affection for feather, in fact I find >> the way it handles dictionaries to be especially distasteful, but I >> wonder if there might be mileage in building off an existing format >> instead of defining another flatbuffer based tabular encoding... >> >> Kind Regards, >> >> Raphael >> >> On 15/08/2024 13:41, Jan Finis wrote: >> > I guess most close source implementations have done these optimizations >> > already, it has just not been done in the open source versions. E.g., we >> > switched to a custom-built thrift runtime using pool allocators and >> string >> > views instead of copied strings a few years ago, seeing comparable >> > speed-ups. The C++ thrift library is just horribly inefficient. >> > >> > I agree with Alkis though that there are some gains that can be >> achieved by >> > optimizing, but the format has inherent drawbacks. Flatbuffers is indeed >> > more efficient but at the cost of increased size. >> > Alkis, can you elaborate how you brought the size of Flatbuffers down? >> > >> > Cheers, >> > Jan >> > >> > Am Do., 15. Aug. 2024 um 13:50 Uhr schrieb Andrew Lamb < >> > andrewlam...@gmail.com>: >> > >> >> I don't disagree that flatbuffers would be faster than thrift decoding >> >> >> >> I am trying to say that with software engineering only (no change to >> the >> >> format) it is likely possible to increase parquet thrift metadata >> parsing >> >> speed by 4x. >> >> >> >> This is not 25x of course, but 4x is non trivial. >> >> >> >> The fact that no one yet has bothered to invest the time to get the 4x >> yet >> >> in open source implementations of parquet suggests to me that the >> parsing >> >> time may not be as critical an issue as we think >> >> >> >> Andrew >> >> >> >> On Thu, Aug 15, 2024 at 6:50 AM Alkis Evlogimenos >> >> <alkis.evlogime...@databricks.com.invalid> wrote: >> >> >> >>> The difference in parsing speed between thrift and flatbuffer is >25x. >> >>> Thrift has some fundamental design decisions that make decoding slow: >> >>> 1. the thrift compact protocol is very data dependent: uleb encoding >> for >> >>> integers, field ids are deltas from previous. The data dependencies >> >>> disallow pipelining of modern cpus >> >>> 2. object model does not have a way to use arenas to avoid many >> >> allocations >> >>> of objects >> >>> If we keep thrift, we can potentially get 2 fixed, but fixing 1 >> requires >> >>> changes to the thrift serialization protocol. Such a change is not >> >>> different from switching serialization format. >> >>> >> >>> >> >>> On Thu, Aug 15, 2024 at 12:30 PM Andrew Lamb <andrewlam...@gmail.com> >> >>> wrote: >> >>> >> >>>> I wanted to share some work Xiangpeng Hao did at InfluxData this >> summer >> >>> on >> >>>> the current (thrift) metadata format[1]. >> >>>> >> >>>> We found that with careful software engineering, we could likely >> >> improve >> >>>> the speed of reading existing parquet footer format by a factor of 4 >> or >> >>>> more ([2] contains some specific ideas). While we analyzed the >> >>>> Rust implementation, I believe a similar conclusion applies to C/C++. >> >>>> >> >>>> I realize that there are certain features that switching to an >> entirely >> >>> new >> >>>> footer format would achieve, but the cost to adopting a new format >> >>>> across the ecosystem is immense (e.g. Parquet "version 2.0" etc). >> >>>> >> >>>> It is my opinion that investing the same effort in software >> >> optimization >> >>>> that would be required for a new footer format would have a much >> bigger >> >>>> impact >> >>>> >> >>>> Andrew >> >>>> >> >>>> [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ >> >>>> [2]: https://github.com/apache/arrow-rs/issues/5853 >> >>>> >> >>>> On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos >> >>>> <alkis.evlogime...@databricks.com.invalid> wrote: >> >>>> >> >>>>> Hi Julien. >> >>>>> >> >>>>> Thank you for reconnecting the threads. >> >>>>> >> >>>>> I have broken down my experiments in a narrative, commit by commit >> on >> >>> how >> >>>>> we can go from flatbuffers being ~2x larger than thrift to being >> >>> smaller >> >>>>> (and at times even half) the size of thrift. This is still on an >> >>> internal >> >>>>> branch, I will resume work towards the end of this month to port it >> >> to >> >>>>> arrow so that folks can look at the progress and share ideas. >> >>>>> >> >>>>> On the benchmarking front I need to build and share a binary for >> >> third >> >>>>> parties to donate their footers for analysis. >> >>>>> >> >>>>> The PR for parquet extensions has gotten a few rounds of reviews: >> >>>>> https://github.com/apache/parquet-format/pull/254. I hope it will >> be >> >>>>> merged >> >>>>> soon. >> >>>>> >> >>>>> I missed the sync yesterday - for some reason I didn't receive an >> >>>>> invitation. Julien could you add me again to the invite list? >> >>>>> >> >>>>> On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org> >> >>> wrote: >> >>>>>> This came up in the sync today. >> >>>>>> >> >>>>>> There are a few concurrent experiments with flatbuffers for a >> >> future >> >>>>>> Parquet footer replacement. In itself it is fine and just wanted to >> >>>>>> reconnect the threads here so that folks are aware of each other >> >> and >> >>>> can >> >>>>>> share findings. >> >>>>>> >> >>>>>> - Neelaksh benchmarking and experiments: >> >>>>>> >> >>>>>> >> >> >> https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1 >> >>>>>> https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking >> >>>>>> >> >>>>>> - Alkis has also been experimenting and led the proposal for >> >> enabling >> >>>>>> extending the existing footer. >> >>>>>> >> >>>>>> >> >> >> https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6 >> >>>>>> - Xuwei also shared that he is looking into this. >> >>>>>> >> >>>>>> I would suggest that you all reply to this thread sharing your >> >>> current >> >>>>>> progress or ideas and a link to your respective repos for >> >>>> experimenting. >> >>>>>> Best >> >>>>>> Julien >> >>>>>> >> >