Thank you for summarizing Micah and thanks to everyone commenting on the proposal and PRs.
After processing the comments I think we might want to discuss the extension point https://github.com/apache/parquet-format/pull/254 separately. The extension point will allow vendors to experiment on different metadata (be it FileMetaData, or ColumnMetaData etc) and when a design is ready and validated in large scale, it can be discussed for inclusion to the official specification. On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > As an update Alkis wrote up a nice summary of his thoughts [1][2]. > > I updated my PR <https://github.com/apache/parquet-format/pull/250> > [3] to be more complete. At a high-level (for those that have already > reviewed): > 1. I converted more fields to use page-encoding (or added a binary field > for thrift serialized encoding when they are expected to be small). > This might be overdone (happy for this feedback to debate). > 2. I removed the concept of an external data page for the sake of trying > to remove design options (we should still benchmark this). It also I think > eases implementation burden (more on this below). > 3. Removed the new encoding. > 4. I think this is still missing some of the exact changes from other > PRs, some of those might be in error (please highlight them) and some are > because I hope the individual PRs (i.e. the statistics change that Alkis > proposed can get merged before any proposal) > > Regarding embedding PAR3 embedding, Alkis's doc [1] highlights another > option for doing this that might be more robust but slightly more > complicated. > > I think in terms of items already discussed, whether to try to reuse > existing structures or use new structures (Alkis is proposing going > straight to flatbuffers in this regard IIUC after some more tactical > changes). I think another point raised is the problem with new structures > is they require implementations (e.g. DuckDB) that do not encapsulate > Thrift well to make potentially much larger structural changes. The way I > tried to approach it in my PR is it should be O(days) work to take a PAR3 > footer and convert it back to PAR1, which will hopefully allow other > Parquet parsers in the ecosystems to at least get incorporated sooner even > if no performance benefits are seen. > > Quoting from a separate thread that Alkis Started: > > 3 is important if we strongly believe that we can get the best design >> through testing prototypes on real data and measuring the effects vs >> designing changes in PRs. Along the same lines, I am requesting that you >> ask through your contacts/customers (I will do the same) for scrubbed >> footers of particular interest (wide, deep, etc) so that we can build a >> set >> of real footers on which we can run benchmarks and drive design decisions. > > > I agree with this sentiment. I think some others who have volunteered to > work on this have such data and I will see what I can do on my end. I > think we should hold off more drastic changes/improvements until we can get > better metrics. But I also don't think we should let the "best" be the > enemy of the "good". I hope we can ship a PAR3 footer sooner that gets us > a large improvement over the status quo and have it adopted fairly widely > sooner rather than waiting for an optimal design. I also agree leaving > room for experimentation is a good idea (I think this can probably be done > by combining the methods for embedding that have already been discussed to > allow potentially 2 embedded footers). > > I think another question that Alkis's proposals raised is how policies on > deprecation of fields (especially ones that are currently required in > PAR1). I think this is probably a better topic for another thread, I'll > try to write a PR formalizing a proposal on feature evolution. > > > > [1] > https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit > [2] https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo > [3] https://github.com/apache/parquet-format/pull/250 > > On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Antoine, >> Thanks for the great points. Responses inline. >> >> >>> I like your attempt to put the "new" file metadata after the legacy >>> one in https://github.com/apache/parquet-format/pull/250, and I hope it >>> can actually be made to work (it requires current Parquet readers to >>> allow/ignore arbitrary padding at the end of the v1 Thrift metadata). >> >> >> Thanks (I hope so too). I think the idea is originally from Alkis. If >> it doesn't work then there is always an option of doing a little more >> involved process of making the footer look like an unknown binary field (an >> approach I know you have objections to). >> >> I'm biased, but I find it much cleaner to define new Thrift >>> structures (FileMetadataV3, etc.), rather than painstakinly document >>> which fields are to be omitted in V3. That would achieve three goals: >>> 1) make the spec easier to read (even though it would be physically >>> longer); 2) make it easier to produce a conformant implementation >>> (special rules increase the risks of misunderstandings and >>> disagreements); 3) allow a later cleanup of the spec once we agree to >>> get rid of V1 structs. >> >> There are trade-offs here. I agree with the benefits you listed here. >> The benefits of reusing existing structs are: >> 1. Lowers the amount of boiler plate code mapping from one to the other >> (i.e. simpler initial implementation), since I expect it will be a while >> before we have standalone PAR3 files. >> 2. Allows for lower maintenance burden if there is useful new metadata >> that we would like to see added to both structures original and "V3" >> structures. >> >> - The new encoding in that PR seems like it should be moved to a >>> separate PR and be discussed in the encodings thread? >> >> >> I'll cross post on that thread. The main reason I included it in my >> proposal is I think it provides random access for members out of the box >> (as compared to the existing encodings). I think this mostly goes to your >> third-point so I'll discuss below. >> >> - I'm a bit skeptical about moving Thrift lists into data pages, rather >>> than, say, just embed the corresponding Thrift serialization as >>> binary fields for lazy deserialization. >> >> I think this falls into 2 different concerns: >> 1. The format of how we serialize metadata. >> 2. Where the serialized metadata lives. >> >> For concern #1, I think we should be considering treating these lists as >> actual parquet data pages. This allows users to tune this to their needs >> for size vs decoding speed, and make use of any improvements to encoding >> that happen in the future without a spec change. I think this is likely >> fairly valuable given the number of systems that cache this data. The >> reason I introduced the new encoding was to provide an option that could be >> as efficient as possible from a compute perspective. >> >> For concern #2, there is no reason encoding a page as a thrift Binary >> field would not work. The main reason I raised putting them outside of >> thrift is for greater control on deserialization (the main benefit being >> avoiding copies) for implementations that have a Thrift parser that doesn't >> allow these optimizations. In terms of a path forward here, I think >> understanding the performance and memory characteristics of each approach. >> I agree, if there isn't substantial savings from having them be outside the >> page, then it just adds complexity. >> >> Thanks, >> Micah >> >> >> >> >> >> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <anto...@python.org> >> wrote: >> >>> >>> Hello Micah, >>> >>> First, kudos for doing this! >>> >>> I like your attempt to put the "new" file metadata after the legacy >>> one in https://github.com/apache/parquet-format/pull/250, and I hope it >>> can actually be made to work (it requires current Parquet readers to >>> allow/ignore arbitrary padding at the end of the v1 Thrift metadata). >>> >>> Some assorted comments on other changes that PR is doing: >>> >>> - I'm biased, but I find it much cleaner to define new Thrift >>> structures (FileMetadataV3, etc.), rather than painstakinly document >>> which fields are to be omitted in V3. That would achieve three goals: >>> 1) make the spec easier to read (even though it would be physically >>> longer); 2) make it easier to produce a conformant implementation >>> (special rules increase the risks of misunderstandings and >>> disagreements); 3) allow a later cleanup of the spec once we agree to >>> get rid of V1 structs. >>> >>> - The new encoding in that PR seems like it should be moved to a >>> separate PR and be discussed in the encodings thread? >>> >>> - I'm a bit skeptical about moving Thrift lists into data pages, rather >>> than, say, just embed the corresponding Thrift serialization as >>> binary fields for lazy deserialization. >>> >>> Regards >>> >>> Antoine. >>> >>> >>> >>> On Mon, 27 May 2024 23:06:37 -0700 >>> Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> > As a follow-up to the "V3" Discussions [1][2] I wanted to start a >>> thread on >>> > improvements to the footer metadata. >>> > >>> > Based on conversation so far, there have been a few proposals >>> [3][4][5] to >>> > help better support files with wide schemas and many row-groups. I >>> think >>> > there are a lot of interesting ideas in each. It would be good to get >>> > further feedback on these to make sure we aren't missing anything and >>> > define a minimal first iteration for doing experimental benchmarking to >>> > prove out an approach. >>> > >>> > I think the next steps would ideally be: >>> > 1. Come to a consensus on the overall approach. >>> > 2. Prototypes to Benchmark/test to validate the approaches defined >>> (if we >>> > can't come to consensus in item #1, this might help choose a >>> direction). >>> > 3. Divide up any final approach into as fine-grained features as >>> possible. >>> > 4. Implement across parquet-java, parquet-cpp, parquet-rs (and any >>> other >>> > implementations that we can get volunteers for). Additionally, if new >>> APIs >>> > are needed to make use of the new structure, it would be good to try to >>> > prototype against consumers of Parquet. >>> > >>> > Knowing that we have enough people interested in doing #3 is critical >>> to >>> > success, so if you have time to devote, it would be helpful to chime in >>> > here (I know some people already noted they could help in the original >>> > thread). >>> > >>> > I think it is likely we will need either an in person sync or another >>> more >>> > focused design document could help. I am happy to try to facilitate >>> this >>> > (once we have a better sense of who wants to be involved and what time >>> > zones they are in I can schedule a sync if necessary). >>> > >>> > Thanks, >>> > Micah >>> > >>> > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo >>> > [2] >>> > >>> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit >>> > [3] https://github.com/apache/parquet-format/pull/242 >>> > [4] https://github.com/apache/parquet-format/pull/248 >>> > [5] https://github.com/apache/parquet-format/pull/250 >>> > >>> >>> >>> >>>