As an update Alkis wrote up a nice summary of his thoughts [1][2]. I updated my PR <https://github.com/apache/parquet-format/pull/250> [3] to be more complete. At a high-level (for those that have already reviewed): 1. I converted more fields to use page-encoding (or added a binary field for thrift serialized encoding when they are expected to be small). This might be overdone (happy for this feedback to debate). 2. I removed the concept of an external data page for the sake of trying to remove design options (we should still benchmark this). It also I think eases implementation burden (more on this below). 3. Removed the new encoding. 4. I think this is still missing some of the exact changes from other PRs, some of those might be in error (please highlight them) and some are because I hope the individual PRs (i.e. the statistics change that Alkis proposed can get merged before any proposal)
Regarding embedding PAR3 embedding, Alkis's doc [1] highlights another option for doing this that might be more robust but slightly more complicated. I think in terms of items already discussed, whether to try to reuse existing structures or use new structures (Alkis is proposing going straight to flatbuffers in this regard IIUC after some more tactical changes). I think another point raised is the problem with new structures is they require implementations (e.g. DuckDB) that do not encapsulate Thrift well to make potentially much larger structural changes. The way I tried to approach it in my PR is it should be O(days) work to take a PAR3 footer and convert it back to PAR1, which will hopefully allow other Parquet parsers in the ecosystems to at least get incorporated sooner even if no performance benefits are seen. Quoting from a separate thread that Alkis Started: 3 is important if we strongly believe that we can get the best design > through testing prototypes on real data and measuring the effects vs > designing changes in PRs. Along the same lines, I am requesting that you > ask through your contacts/customers (I will do the same) for scrubbed > footers of particular interest (wide, deep, etc) so that we can build a set > of real footers on which we can run benchmarks and drive design decisions. I agree with this sentiment. I think some others who have volunteered to work on this have such data and I will see what I can do on my end. I think we should hold off more drastic changes/improvements until we can get better metrics. But I also don't think we should let the "best" be the enemy of the "good". I hope we can ship a PAR3 footer sooner that gets us a large improvement over the status quo and have it adopted fairly widely sooner rather than waiting for an optimal design. I also agree leaving room for experimentation is a good idea (I think this can probably be done by combining the methods for embedding that have already been discussed to allow potentially 2 embedded footers). I think another question that Alkis's proposals raised is how policies on deprecation of fields (especially ones that are currently required in PAR1). I think this is probably a better topic for another thread, I'll try to write a PR formalizing a proposal on feature evolution. [1] https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit [2] https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo [3] https://github.com/apache/parquet-format/pull/250 On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Antoine, > Thanks for the great points. Responses inline. > > >> I like your attempt to put the "new" file metadata after the legacy >> one in https://github.com/apache/parquet-format/pull/250, and I hope it >> can actually be made to work (it requires current Parquet readers to >> allow/ignore arbitrary padding at the end of the v1 Thrift metadata). > > > Thanks (I hope so too). I think the idea is originally from Alkis. If it > doesn't work then there is always an option of doing a little more involved > process of making the footer look like an unknown binary field (an approach > I know you have objections to). > > I'm biased, but I find it much cleaner to define new Thrift >> structures (FileMetadataV3, etc.), rather than painstakinly document >> which fields are to be omitted in V3. That would achieve three goals: >> 1) make the spec easier to read (even though it would be physically >> longer); 2) make it easier to produce a conformant implementation >> (special rules increase the risks of misunderstandings and >> disagreements); 3) allow a later cleanup of the spec once we agree to >> get rid of V1 structs. > > There are trade-offs here. I agree with the benefits you listed here. > The benefits of reusing existing structs are: > 1. Lowers the amount of boiler plate code mapping from one to the other > (i.e. simpler initial implementation), since I expect it will be a while > before we have standalone PAR3 files. > 2. Allows for lower maintenance burden if there is useful new metadata > that we would like to see added to both structures original and "V3" > structures. > > - The new encoding in that PR seems like it should be moved to a >> separate PR and be discussed in the encodings thread? > > > I'll cross post on that thread. The main reason I included it in my > proposal is I think it provides random access for members out of the box > (as compared to the existing encodings). I think this mostly goes to your > third-point so I'll discuss below. > > - I'm a bit skeptical about moving Thrift lists into data pages, rather >> than, say, just embed the corresponding Thrift serialization as >> binary fields for lazy deserialization. > > I think this falls into 2 different concerns: > 1. The format of how we serialize metadata. > 2. Where the serialized metadata lives. > > For concern #1, I think we should be considering treating these lists as > actual parquet data pages. This allows users to tune this to their needs > for size vs decoding speed, and make use of any improvements to encoding > that happen in the future without a spec change. I think this is likely > fairly valuable given the number of systems that cache this data. The > reason I introduced the new encoding was to provide an option that could be > as efficient as possible from a compute perspective. > > For concern #2, there is no reason encoding a page as a thrift Binary > field would not work. The main reason I raised putting them outside of > thrift is for greater control on deserialization (the main benefit being > avoiding copies) for implementations that have a Thrift parser that doesn't > allow these optimizations. In terms of a path forward here, I think > understanding the performance and memory characteristics of each approach. > I agree, if there isn't substantial savings from having them be outside the > page, then it just adds complexity. > > Thanks, > Micah > > > > > > On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Hello Micah, >> >> First, kudos for doing this! >> >> I like your attempt to put the "new" file metadata after the legacy >> one in https://github.com/apache/parquet-format/pull/250, and I hope it >> can actually be made to work (it requires current Parquet readers to >> allow/ignore arbitrary padding at the end of the v1 Thrift metadata). >> >> Some assorted comments on other changes that PR is doing: >> >> - I'm biased, but I find it much cleaner to define new Thrift >> structures (FileMetadataV3, etc.), rather than painstakinly document >> which fields are to be omitted in V3. That would achieve three goals: >> 1) make the spec easier to read (even though it would be physically >> longer); 2) make it easier to produce a conformant implementation >> (special rules increase the risks of misunderstandings and >> disagreements); 3) allow a later cleanup of the spec once we agree to >> get rid of V1 structs. >> >> - The new encoding in that PR seems like it should be moved to a >> separate PR and be discussed in the encodings thread? >> >> - I'm a bit skeptical about moving Thrift lists into data pages, rather >> than, say, just embed the corresponding Thrift serialization as >> binary fields for lazy deserialization. >> >> Regards >> >> Antoine. >> >> >> >> On Mon, 27 May 2024 23:06:37 -0700 >> Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > As a follow-up to the "V3" Discussions [1][2] I wanted to start a >> thread on >> > improvements to the footer metadata. >> > >> > Based on conversation so far, there have been a few proposals [3][4][5] >> to >> > help better support files with wide schemas and many row-groups. I >> think >> > there are a lot of interesting ideas in each. It would be good to get >> > further feedback on these to make sure we aren't missing anything and >> > define a minimal first iteration for doing experimental benchmarking to >> > prove out an approach. >> > >> > I think the next steps would ideally be: >> > 1. Come to a consensus on the overall approach. >> > 2. Prototypes to Benchmark/test to validate the approaches defined (if >> we >> > can't come to consensus in item #1, this might help choose a direction). >> > 3. Divide up any final approach into as fine-grained features as >> possible. >> > 4. Implement across parquet-java, parquet-cpp, parquet-rs (and any >> other >> > implementations that we can get volunteers for). Additionally, if new >> APIs >> > are needed to make use of the new structure, it would be good to try to >> > prototype against consumers of Parquet. >> > >> > Knowing that we have enough people interested in doing #3 is critical to >> > success, so if you have time to devote, it would be helpful to chime in >> > here (I know some people already noted they could help in the original >> > thread). >> > >> > I think it is likely we will need either an in person sync or another >> more >> > focused design document could help. I am happy to try to facilitate this >> > (once we have a better sense of who wants to be involved and what time >> > zones they are in I can schedule a sync if necessary). >> > >> > Thanks, >> > Micah >> > >> > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo >> > [2] >> > >> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit >> > [3] https://github.com/apache/parquet-format/pull/242 >> > [4] https://github.com/apache/parquet-format/pull/248 >> > [5] https://github.com/apache/parquet-format/pull/250 >> > >> >> >> >>