Thank you for summarizing Micah and thanks to everyone commenting on the
proposal and PRs.

After processing the comments I think we might want to discuss the
extension point https://github.com/apache/parquet-format/pull/254
separately.

The extension point will allow vendors to experiment on different metadata
(be it FileMetaData, or ColumnMetaData etc) and when a design is ready and
validated in large scale, it can be discussed for inclusion to the official
specification.

On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> As an update Alkis wrote up a nice summary of his thoughts [1][2].
>
> I updated my PR <https://github.com/apache/parquet-format/pull/250>
> [3] to be more complete.  At a high-level (for those that have already
> reviewed):
> 1. I converted more fields to use page-encoding (or added a binary field
> for thrift serialized encoding when they are expected to be small).
> This might be overdone (happy for this feedback to debate).
> 2.  I removed the concept of an external data page for the sake of trying
> to remove design options (we should still benchmark this). It also I think
> eases implementation burden (more on this below).
> 3.  Removed the new encoding.
> 4.  I think this is still missing some of the exact changes from other
> PRs, some of those might be in error (please highlight them) and some are
> because I hope the individual PRs (i.e. the statistics change that Alkis
> proposed can get merged before any proposal)
>
> Regarding embedding PAR3 embedding, Alkis's doc [1] highlights another
> option for doing this that might be more robust but slightly more
> complicated.
>
> I think in terms of items already discussed, whether to try to reuse
> existing structures or use new structures (Alkis is proposing going
> straight to flatbuffers in this regard IIUC after some more tactical
> changes).  I think another point raised is the problem with new structures
> is they require implementations (e.g. DuckDB) that do not encapsulate
> Thrift well to make potentially much larger structural changes.  The way I
> tried to approach it in my PR is it should be O(days) work to take a PAR3
> footer and convert it back to PAR1, which will hopefully allow other
> Parquet parsers in the ecosystems to at least get incorporated sooner even
> if no performance benefits are seen.
>
> Quoting from a separate thread that Alkis Started:
>
> 3 is important if we strongly believe that we can get the best design
>> through testing prototypes on real data and measuring the effects vs
>> designing changes in PRs. Along the same lines, I am requesting that you
>> ask through your contacts/customers (I will do the same) for scrubbed
>> footers of particular interest (wide, deep, etc) so that we can build a
>> set
>> of real footers on which we can run benchmarks and drive design decisions.
>
>
> I agree with this sentiment. I think some others who have volunteered to
> work on this have such data and I will see what I can do on my end.  I
> think we should hold off more drastic changes/improvements until we can get
> better metrics.  But I also don't think we should let the "best" be the
> enemy of the "good".  I hope we can ship a PAR3 footer sooner that gets us
> a large improvement over the status quo and have it adopted fairly widely
> sooner rather than waiting for an optimal design.  I also agree leaving
> room for experimentation is a good idea (I think this can probably be done
> by combining the methods for embedding that have already been discussed to
> allow potentially 2 embedded footers).
>
> I think another question that Alkis's proposals raised is how policies on
> deprecation of fields (especially ones that are currently required in
> PAR1).  I think this is probably a better topic for another thread, I'll
> try to write a PR formalizing a proposal on feature evolution.
>
>
>
> [1]
> https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
> [2] https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo
> [3] https://github.com/apache/parquet-format/pull/250
>
> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Antoine,
>> Thanks for the great points.  Responses inline.
>>
>>
>>> I like your attempt to put the "new" file metadata after the legacy
>>> one in https://github.com/apache/parquet-format/pull/250, and I hope it
>>> can actually be made to work (it requires current Parquet readers to
>>> allow/ignore arbitrary padding at the end of the v1 Thrift metadata).
>>
>>
>> Thanks (I hope so too).  I think the idea is originally from Alkis.  If
>> it doesn't work then there is always an option of doing a little more
>> involved process of making the footer look like an unknown binary field (an
>> approach I know you have objections to).
>>
>> I'm biased, but I find it much cleaner to define new Thrift
>>>   structures (FileMetadataV3, etc.), rather than painstakinly document
>>>   which fields are to be omitted in V3. That would achieve three goals:
>>>   1) make the spec easier to read (even though it would be physically
>>>   longer); 2) make it easier to produce a conformant implementation
>>>   (special rules increase the risks of misunderstandings and
>>>   disagreements); 3) allow a later cleanup of the spec once we agree to
>>>   get rid of V1 structs.
>>
>> There are trade-offs here.  I agree with the benefits you listed here.
>> The benefits of reusing existing structs are:
>> 1. Lowers the amount of boiler plate code mapping from one to the other
>> (i.e. simpler initial implementation), since I expect it will be a while
>> before we have standalone PAR3 files.
>> 2. Allows for lower maintenance burden if there is useful new metadata
>> that we would like to see added to both structures original and "V3"
>> structures.
>>
>> - The new encoding in that PR seems like it should be moved to a
>>>   separate PR and be discussed in the encodings thread?
>>
>>
>> I'll cross post on that thread.  The main reason I included it in my
>> proposal is I think it provides random access for members out of the box
>> (as compared to the existing encodings).  I think this mostly goes to your
>> third-point so I'll discuss below.
>>
>> - I'm a bit skeptical about moving Thrift lists into data pages, rather
>>>   than, say, just embed the corresponding Thrift serialization as
>>>   binary fields for lazy deserialization.
>>
>> I think this falls into 2 different concerns:
>> 1.  The format of how we serialize metadata.
>> 2.  Where the serialized metadata lives.
>>
>> For concern #1, I think we should be considering treating these lists as
>> actual parquet data pages.  This allows users to tune this to their needs
>> for size vs decoding speed, and make use of any improvements to encoding
>> that happen in the future without a spec change. I think this is likely
>> fairly valuable given the number of systems that cache this data.  The
>> reason I introduced the new encoding was to provide an option that could be
>> as efficient as possible from a compute perspective.
>>
>> For concern #2, there is no reason encoding a page as a thrift Binary
>> field would not work. The main reason I raised putting them outside of
>> thrift is for greater control on deserialization (the main benefit being
>> avoiding copies) for implementations that have a Thrift parser that doesn't
>> allow these optimizations.  In terms of a path forward here, I think
>> understanding the performance and memory characteristics of each approach.
>> I agree, if there isn't substantial savings from having them be outside the
>> page, then it just adds complexity.
>>
>> Thanks,
>> Micah
>>
>>
>>
>>
>>
>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <anto...@python.org>
>> wrote:
>>
>>>
>>> Hello Micah,
>>>
>>> First, kudos for doing this!
>>>
>>> I like your attempt to put the "new" file metadata after the legacy
>>> one in https://github.com/apache/parquet-format/pull/250, and I hope it
>>> can actually be made to work (it requires current Parquet readers to
>>> allow/ignore arbitrary padding at the end of the v1 Thrift metadata).
>>>
>>> Some assorted comments on other changes that PR is doing:
>>>
>>> - I'm biased, but I find it much cleaner to define new Thrift
>>>   structures (FileMetadataV3, etc.), rather than painstakinly document
>>>   which fields are to be omitted in V3. That would achieve three goals:
>>>   1) make the spec easier to read (even though it would be physically
>>>   longer); 2) make it easier to produce a conformant implementation
>>>   (special rules increase the risks of misunderstandings and
>>>   disagreements); 3) allow a later cleanup of the spec once we agree to
>>>   get rid of V1 structs.
>>>
>>> - The new encoding in that PR seems like it should be moved to a
>>>   separate PR and be discussed in the encodings thread?
>>>
>>> - I'm a bit skeptical about moving Thrift lists into data pages, rather
>>>   than, say, just embed the corresponding Thrift serialization as
>>>   binary fields for lazy deserialization.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>> On Mon, 27 May 2024 23:06:37 -0700
>>> Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>> > As a follow-up to the "V3" Discussions [1][2] I wanted to start a
>>> thread on
>>> > improvements to the footer metadata.
>>> >
>>> > Based on conversation so far, there have been a few proposals
>>> [3][4][5] to
>>> > help better support files with wide schemas and many row-groups.  I
>>> think
>>> > there are a lot of interesting ideas in each. It would be good to get
>>> > further feedback on these to make sure we aren't missing anything and
>>> > define a minimal first iteration for doing experimental benchmarking to
>>> > prove out an approach.
>>> >
>>> > I think the next steps would ideally be:
>>> > 1.  Come to a consensus on the overall approach.
>>> > 2.  Prototypes to Benchmark/test to validate the approaches defined
>>> (if we
>>> > can't come to consensus in item #1, this might help choose a
>>> direction).
>>> > 3.  Divide up any final approach into as fine-grained features as
>>> possible.
>>> > 4.  Implement across parquet-java, parquet-cpp, parquet-rs (and any
>>> other
>>> > implementations that we can get volunteers for).  Additionally, if new
>>> APIs
>>> > are needed to make use of the new structure, it would be good to try to
>>> > prototype against consumers of Parquet.
>>> >
>>> > Knowing that we have enough people interested in doing #3 is critical
>>> to
>>> > success, so if you have time to devote, it would be helpful to chime in
>>> > here (I know some people already noted they could help in the original
>>> > thread).
>>> >
>>> > I think it is likely we will need either an in person sync or another
>>> more
>>> > focused design document could help. I am happy to try to facilitate
>>> this
>>> > (once we have a better sense of who wants to be involved and what time
>>> > zones they are in I can schedule a sync if necessary).
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
>>> > [2]
>>> >
>>> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
>>> > [3] https://github.com/apache/parquet-format/pull/242
>>> > [4] https://github.com/apache/parquet-format/pull/248
>>> > [5] https://github.com/apache/parquet-format/pull/250
>>> >
>>>
>>>
>>>
>>>

Reply via email to