> > How would a thrift extension help if we'll be moving to FlatBuffers for > metadata?
I'm not sure I understand the question, but I think the only question raised so far if we should be trying to have smaller incremental steps in optimizations (e.g. thrift modification) or if we should make a larger jump to something like flatbuffers. Another option which again might be too complicated but relies a little bit less on flatbuffers is to "shred" metadata into Arrow/Parquet itself. The main advantage I see to this technique is that it possibly yields better compression results, which I think are a win if the metadata is straddling the line between 1 and 2 reads of the footer. As already discussed, these approaches have a major downside in overall complexity. How do the two things work together? How are we planning to > extend FlatBuffers? I don't think we've seen any formal proposal here. Alkis was working on a prototype but I don't think has published anything for comments yet. Thanks, Micah On Wed, Jun 5, 2024 at 10:07 AM Jan Finis <jpfi...@gmail.com> wrote: > How would a thrift extension help if we'll be moving to FlatBuffers for > metadata? How do the two things work together? How are we planning to > extend FlatBuffers? > > > > Am Mi., 5. Juni 2024 um 18:48 Uhr schrieb Micah Kornfield < > emkornfi...@gmail.com>: > > > > > > > 1. ratify https://github.com/apache/parquet-format/pull/254 as the > > > extension mechanism for parquet. With this we can experiment on new > > footers > > > without having to specify anything else. > > > > > > I think we have probably reached a lazy consensus that is reasonable. I > > think I misspoke earlier but we should at least have parquet-java and a > > second implementation showing that we can write out the arbitrary bytes > > without too much issue (and also read the a file that is written in this > > format). Alkis would you be able to do this? > > > > > > 3. collaborate on a couple of prototypes, test them in production and > come > > > up with a report advocating for their inclusion to parquet proper. With > > (1) > > > in place these experiments/prototypes can be done in parallel and > tested > > by > > > different organizations without coupling > > > > > > I think it makes sense to timebox. Did you have any thoughts on the > > duration of experimentation? > > > > 4. decide which candidate is made official announce the migration path > > > (deprecate old footers and give timeline for stopping the emission of > > dual > > > footers) > > > > I hope the strawman proposal on feature releases [1] can be refined and > > applied to this case. > > > > > > Thanks, > > Micah > > > > [1] https://github.com/apache/parquet-format/pull/258 > > > > > > > > On Wed, Jun 5, 2024 at 2:47 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Google docs tend to get lost very quickly. My experience with the > > > Python PEP process leads me to a preference for a .md file in the repo, > > > that can be collectively owned and rely on regular GH-based review > > > tools. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > > On Tue, 4 Jun 2024 18:52:42 -0700 > > > Julien Le Dem <jul...@apache.org> wrote: > > > > I agree that flatbuffer is a good option if we are happy with the > perf > > > and > > > > it let's access column metadata in O(1) without reading other > columns. > > > > If we're going to make an incompatible metadata change, let's make it > > > once > > > > with a transition path to easily move from PAR1 to PAR3 letting them > > > > coexist in a backward compatible phase for a while. > > > > > > > > I think that before voting on this, we should summarize in a doc the > > > whole > > > > PAR3 footer metadata discussion: > > > > 1) Goals: (O(1) random access, extensibility, ...) > > > > 2) preferred option > > > > 3) migration path. > > > > 4) mention other options we considered and why we didn't pick them > > (this > > > > doesn't have to be extensive) > > > > > > > > That will make it easier for people who are impacted but haven't > > actively > > > > contributed to this discussion so far to review and chime in. > > > > This is a big change with large potential impact here. > > > > > > > > Do people prefer google doc or a PR with a .md for this? I personally > > > like > > > > google docs (we can copy it in the repo after approval) > > > > > > > > Julien > > > > > > > > > > > > On Tue, Jun 4, 2024 at 1:53 AM Alkis Evlogimenos > > > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > > > > > > > > > > > > > > Finally, one point I wanted to highlight here (I also mentioned > it > > > in the > > > > > > PR): If we want random access, we have to abolish the concept > that > > > the > > > > > data > > > > > > in the columns array is in a different order than in the schema. > > > Your PR > > > > > > [1] even added a new field schema_index for matching between > > > > > > ColumnMetaData and schema position, but this kills random access. > > If > > > I > > > > > want > > > > > > to read the third column in the schema, then do a O(1) random > > access > > > into > > > > > > the third column chunk only to notice that it's schema index is > > > totally > > > > > > different and therefore I need a full exhaustive search to find > the > > > > > column > > > > > > that actually belongs to the third column in the schema, then all > > our > > > > > > random access efforts are in vain. > > > > > > > > > > > > > > > `schema_index` is useful to implement > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more > and > > > more > > > > > prevalent as schemata become wider. > > > > > > > > > > On Mon, Jun 3, 2024 at 5:54 PM Micah Kornfield < > > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks everyone for chiming in. Some responses inline: > > > > > > > > > > > > > > > > > > > > The thrift > > > > > > > decoder just has to be invoked recursively whenever such a lazy > > > field > > > > > is > > > > > > > required. This is nice, but since it doesn't give us random > > access > > > into > > > > > > > lists, it's also only partially helpful. > > > > > > > > > > > > This point is moot if we move to flatbuffers but I think part of > > the > > > > > > proposals are either using list<binary> or providing arrow-like > > > offsets > > > > > > into the serialized binary to support random access of elements. > > > > > > > > > > > > > > > > > > > I don't fully understand this point, can you elaborate on it. > It > > > feels > > > > > > like > > > > > > > a non-issue or a super edge case to me. Is this just a DuckDB > > > issue? If > > > > > > so, > > > > > > > I am very sure they're happy to change this, as they're quite > > > active > > > > > and > > > > > > > also strive for simplicity and I would argue that exposing > thrift > > > > > > directly > > > > > > > isn't that. > > > > > > > > > > > > IIUC, I don't think Thrift is public from an end-user > perspective. > > > It is > > > > > > however public in the fact that internally DuckDB exposes the > > Thrift > > > > > > structs directly to consuming code. > > > > > > > > > > > > * I don't think there is value in providing a 1-to-1 mapping from > > > the > > > > > > > old footer encoding to the new encoding. On the contrary, > this > > > is the > > > > > > > opportunity to clean up and correct some of the oddities that > > > have > > > > > > > accumulated in the past. > > > > > > > > > > > > I think I should clarify this, as I see a few distinct cases > here: > > > > > > > > > > > > 1. Removing duplication/redundancy that accumulated over the > years > > > for > > > > > > backwards compatibility. > > > > > > 2. Removing fields that were never used in practice. > > > > > > 3. Changing the layout of fields (e.g. moving from array of > > structs > > > to > > > > > > struct of arrays) for performance considerations. > > > > > > 4. Writing potentially less metadata (e.g. summarization of > > metadata > > > > > > today). > > > > > > > > > > > > IMO, I think we should be doing 1,2, and 3. I don't think we > > should > > > be > > > > > > doing 4 (e.g. as a concrete example, see the discussion on > > > > > > PageEncodingStats [1]). > > > > > > > > > > > > If we want random access, we have to abolish the concept that the > > > data > > > > > > > in the columns array is in a different order than in the > schema. > > > Your > > > > > PR > > > > > > > [1] even added a new field schema_index for matching between > > > > > > ColumnMetaData > > > > > > > and schema position, but this kills random access. > > > > > > > > > > > > > > > > > > I think this is a larger discussion that should be split off, as > I > > > don't > > > > > > think it should block the core work here. This was adapted from > > > another > > > > > > proposal, that I think had different ideas on how possible rework > > > column > > > > > > selection (it seems this would be on a per RowGroup basis). > > > > > > > > > > > > [1] > > > https://github.com/apache/parquet-format/pull/250/files#r1620984136 > > > > > > > > > > > > > > > > > > On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou < > anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Everything Jan said below aligns closely with my opinion. > > > > > > > > > > > > > > * +1 for going directly to Flatbuffers for the new footer > format > > > *if* > > > > > > > there is a general agreement that moving to Flatbuffers at > some > > > point > > > > > > > is desirable (including from a software ecosystem point of > > view). > > > > > > > > > > > > > > * I don't think there is value in providing a 1-to-1 mapping > from > > > the > > > > > > > old footer encoding to the new encoding. On the contrary, > this > > > is the > > > > > > > opportunity to clean up and correct some of the oddities that > > > have > > > > > > > accumulated in the past. > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > On Mon, 3 Jun 2024 15:58:40 +0200 > > > > > > > Jan Finis <jpfi...@gmail.com> wrote: > > > > > > > > Interesting discussion so far, thanks for driving this > Micah! A > > > few > > > > > > > points > > > > > > > > from my side: > > > > > > > > > > > > > > > > When considering flatbuffers vs. lazy "binary" nested thrift, > > > vs. own > > > > > > > > MetaDataPage format, let's also keep architectural simplicity > > > in > > > > > mind. > > > > > > > > > > > > > > > > For example, introducing flatbuffers might sound like a big > > > change at > > > > > > > > first, but at least it is then *one format* for everything. > In > > > > > > contrast, > > > > > > > > thrift + custom MetaDataPage is two formats. My gut feeling > > > estimate > > > > > > > > would be that it is probably easier to just introduce a > > > flatbuffers > > > > > > > reader > > > > > > > > instead of special casing some thrift to instead need a > custom > > > > > > > MetaDataPage > > > > > > > > reader. > > > > > > > > > > > > > > > > The lazy thrift "hack" is something in between the two. It is > > > > > probably > > > > > > > the > > > > > > > > easiest to adopt, as no new reading logic needs to be > written. > > > The > > > > > > thrift > > > > > > > > decoder just has to be invoked recursively whenever such a > lazy > > > field > > > > > > is > > > > > > > > required. This is nice, but since it doesn't give us random > > > access > > > > > into > > > > > > > > lists, it's also only partially helpful. > > > > > > > > > > > > > > > > Given all this, from the implementation / architectural > > > cleanliness > > > > > > > side, I > > > > > > > > guess I would prefer just using flatbuffers, unless we find > big > > > > > > > > disadvantages with this. This also brings us closer to Arrow, > > > > > although > > > > > > > > that's not too important here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. I think for an initial revision of metadata we should > > make > > > it > > > > > > > possible > > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever is > > > included > > > > > > in > > > > > > > the > > > > > > > > > new footer. The rationale for this is to let > implementations > > > that > > > > > > > haven't > > > > > > > > > abstracted out thrift structures an easy path to > > incorporating > > > the > > > > > > new > > > > > > > > > footer (i.e. just do translation at the boundaries). > > > > > > > > > > > > > > > > > > > > > > > > > I don't fully understand this point, can you elaborate on it. > > > It > > > > > feels > > > > > > > like > > > > > > > > a non-issue or a super edge case to me. Is this just a DuckDB > > > issue? > > > > > If > > > > > > > so, > > > > > > > > I am very sure they're happy to change this, as they're quite > > > active > > > > > > and > > > > > > > > also strive for simplicity and I would argue that exposing > > > thrift > > > > > > > directly > > > > > > > > isn't that. Our database also allows metadata access in SQL, > > but > > > we > > > > > > > > transcode the thrift into JSON. Given that JSON is pretty > > > standard in > > > > > > > > databases while thrift isn't, I'm sure DuckDB devs will see > it > > > the > > > > > > same. > > > > > > > > > > > > > > > > > > > > > > > > Finally, one point I wanted to highlight here (I also > mentioned > > > it in > > > > > > the > > > > > > > > PR): If we want random access, we have to abolish the concept > > > that > > > > > the > > > > > > > data > > > > > > > > in the columns array is in a different order than in the > > schema. > > > Your > > > > > > PR > > > > > > > > [1] even added a new field schema_index for matching between > > > > > > > ColumnMetaData > > > > > > > > and schema position, but this kills random access. If I want > to > > > read > > > > > > the > > > > > > > > third column in the schema, then do a O(1) random access into > > > the > > > > > third > > > > > > > > column chunk only to notice that it's schema index is totally > > > > > different > > > > > > > and > > > > > > > > therefore I need a full exhaustive search to find the column > > > that > > > > > > > actually > > > > > > > > belongs to the third column in the schema, then all our > random > > > access > > > > > > > > efforts are in vain. > > > > > > > > > > > > > > > > Therefore, the only possible way to make random access useful > > is > > > to > > > > > > > mandate > > > > > > > > that ColumnMetaData in the columns list has to be in exactly > > the > > > same > > > > > > > order > > > > > > > > in which the columns appear in the schema. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Jan > > > > > > > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/250 > > > > > > > > > > > > > > > > > > > > > > > > Am Sa., 1. Juni 2024 um 10:38 Uhr schrieb Micah Kornfield < > > > > > > > > emkornfi...@gmail.com>: > > > > > > > > > > > > > > > > > As an update here/some responses. Alkis [3] is making > > > considerable > > > > > > > > > progress on a Flatbuffer alternative that shows good > > > performance > > > > > > > benchmarks > > > > > > > > > on some real sample footers (and hopefully soon some > > synthetic > > > data > > > > > > > from > > > > > > > > > Rok). > > > > > > > > > > > > > > > > > > The approaches that currently have public PRs [1][2] IIUC > > > mostly > > > > > save > > > > > > > time > > > > > > > > > by lazily decompressing thrift metadata (some of the > details > > > differ > > > > > > > but it > > > > > > > > > is effectively the same mechanism). This helps for cases > > > when > > > > > only a > > > > > > > few > > > > > > > > > row groups/columns are needed but in the limit has the same > > > > > > theoretical > > > > > > > > > performance penalties for full table reads. > > > > > > > > > > > > > > > > > > I would like to get people's take on two points: > > > > > > > > > 1. I think for an initial revision of metadata we should > > make > > > it > > > > > > > possible > > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever is > > > included > > > > > > in > > > > > > > the > > > > > > > > > new footer. The rationale for this is to let > implementations > > > that > > > > > > > haven't > > > > > > > > > abstracted out thrift structures an easy path to > > incorporating > > > the > > > > > > new > > > > > > > > > footer (i.e. just do translation at the boundaries). > > > > > > > > > 2. Do people see value in trying to do a Thrift only > > > iteration > > > > > which > > > > > > > > > addresses the use-case of scanning only a select number of > > row > > > > > > > > > groups/columns? Or if Flatbuffers offer an overall better > > > > > > performance > > > > > > > > > should we jump to using it? > > > > > > > > > > > > > > > > > > After processing the comments I think we might want to > > discuss > > > the > > > > > > > > > > extension point > > > > > https://github.com/apache/parquet-format/pull/254 > > > > > > > > > > separately. > > > > > > > > > > > > > > > > > > > > > > > > > > > I think this is already getting reviewed (I also think we > > > touched > > > > > on > > > > > > > it in > > > > > > > > > the extensibility thread). Since this is really just > > defining > > > how > > > > > we > > > > > > > can > > > > > > > > > encapsulate data and doesn't involve any upfront work, I > > think > > > once > > > > > > > > > everyone has had a chance to comment on it we can hopefully > > > hold a > > > > > > > vote on > > > > > > > > > it (hopefully in the next week or 2). I think the only > other > > > > > viable > > > > > > > > > alternative is what is proposed in [2] which doesn't > involve > > > any > > > > > > > mucking > > > > > > > > > with Thrift bytes but poses a slightly larger compatibility > > > risk. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Micah > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/242 > > > > > > > > > [2] https://github.com/apache/parquet-format/pull/250 > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/pull/250#pullrequestreview-2091174869 > > > > > > > > > > > > > > > > > > > > > On Thu, May 30, 2024 at 7:21 AM Alkis Evlogimenos < > > > > > > > > > alkis.evlogime...@databricks.com> wrote: > > > > > > > > > > > > > > > > > > > Thank you for summarizing Micah and thanks to everyone > > > commenting > > > > > > on > > > > > > > the > > > > > > > > > > proposal and PRs. > > > > > > > > > > > > > > > > > > > > After processing the comments I think we might want to > > > discuss > > > > > the > > > > > > > > > > extension point > > > > > https://github.com/apache/parquet-format/pull/254 > > > > > > > > > > separately. > > > > > > > > > > > > > > > > > > > > The extension point will allow vendors to experiment on > > > different > > > > > > > > > metadata > > > > > > > > > > (be it FileMetaData, or ColumnMetaData etc) and when a > > > design is > > > > > > > ready > > > > > > > > > and > > > > > > > > > > validated in large scale, it can be discussed for > inclusion > > > to > > > > > the > > > > > > > > > official > > > > > > > > > > specification. > > > > > > > > > > > > > > > > > > > > On Thu, May 30, 2024 at 9:37 AM Micah Kornfield < > > > > > > > emkornfi...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> As an update Alkis wrote up a nice summary of his > thoughts > > > > > [1][2]. > > > > > > > > > >> > > > > > > > > > >> I updated my PR < > > > > > > https://github.com/apache/parquet-format/pull/250> > > > > > > > > > >> [3] to be more complete. At a high-level (for those > that > > > have > > > > > > > already > > > > > > > > > >> reviewed): > > > > > > > > > >> 1. I converted more fields to use page-encoding (or > added > > > a > > > > > binary > > > > > > > field > > > > > > > > > >> for thrift serialized encoding when they are expected to > > > be > > > > > > small). > > > > > > > > > >> This might be overdone (happy for this feedback to > > debate). > > > > > > > > > >> 2. I removed the concept of an external data page for > the > > > sake > > > > > of > > > > > > > > > trying > > > > > > > > > >> to remove design options (we should still benchmark > this). > > > It > > > > > also > > > > > > > I > > > > > > > > > think > > > > > > > > > >> eases implementation burden (more on this below). > > > > > > > > > >> 3. Removed the new encoding. > > > > > > > > > >> 4. I think this is still missing some of the exact > > changes > > > from > > > > > > > other > > > > > > > > > >> PRs, some of those might be in error (please highlight > > > them) and > > > > > > > some > > > > > > > > > are > > > > > > > > > >> because I hope the individual PRs (i.e. the statistics > > > change > > > > > that > > > > > > > Alkis > > > > > > > > > >> proposed can get merged before any proposal) > > > > > > > > > >> > > > > > > > > > >> Regarding embedding PAR3 embedding, Alkis's doc [1] > > > highlights > > > > > > > another > > > > > > > > > >> option for doing this that might be more robust but > > > slightly > > > > > more > > > > > > > > > >> complicated. > > > > > > > > > >> > > > > > > > > > >> I think in terms of items already discussed, whether to > > try > > > to > > > > > > reuse > > > > > > > > > >> existing structures or use new structures (Alkis is > > > proposing > > > > > > going > > > > > > > > > >> straight to flatbuffers in this regard IIUC after some > > > more > > > > > > tactical > > > > > > > > > >> changes). I think another point raised is the problem > > with > > > new > > > > > > > > > structures > > > > > > > > > >> is they require implementations (e.g. DuckDB) that do > not > > > > > > > encapsulate > > > > > > > > > >> Thrift well to make potentially much larger structural > > > changes. > > > > > > > The > > > > > > > > > way I > > > > > > > > > >> tried to approach it in my PR is it should be O(days) > work > > > to > > > > > take > > > > > > > a > > > > > > > > > PAR3 > > > > > > > > > >> footer and convert it back to PAR1, which will hopefully > > > allow > > > > > > other > > > > > > > > > >> Parquet parsers in the ecosystems to at least get > > > incorporated > > > > > > > sooner > > > > > > > > > even > > > > > > > > > >> if no performance benefits are seen. > > > > > > > > > >> > > > > > > > > > >> Quoting from a separate thread that Alkis Started: > > > > > > > > > >> > > > > > > > > > >> 3 is important if we strongly believe that we can get > the > > > best > > > > > > > design > > > > > > > > > >>> through testing prototypes on real data and measuring > the > > > > > effects > > > > > > > vs > > > > > > > > > >>> designing changes in PRs. Along the same lines, I am > > > requesting > > > > > > > that > > > > > > > > > you > > > > > > > > > >>> ask through your contacts/customers (I will do the > same) > > > for > > > > > > > scrubbed > > > > > > > > > >>> footers of particular interest (wide, deep, etc) so > that > > > we can > > > > > > > build a > > > > > > > > > >>> set > > > > > > > > > >>> of real footers on which we can run benchmarks and > drive > > > design > > > > > > > > > >>> decisions. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> I agree with this sentiment. I think some others who > have > > > > > > > volunteered to > > > > > > > > > >> work on this have such data and I will see what I can do > > on > > > my > > > > > > > end. I > > > > > > > > > >> think we should hold off more drastic > changes/improvements > > > until > > > > > > we > > > > > > > can > > > > > > > > > get > > > > > > > > > >> better metrics. But I also don't think we should let > the > > > "best" > > > > > > be > > > > > > > the > > > > > > > > > >> enemy of the "good". I hope we can ship a PAR3 footer > > > sooner > > > > > that > > > > > > > gets > > > > > > > > > us > > > > > > > > > >> a large improvement over the status quo and have it > > > adopted > > > > > fairly > > > > > > > > > widely > > > > > > > > > >> sooner rather than waiting for an optimal design. I > also > > > agree > > > > > > > leaving > > > > > > > > > >> room for experimentation is a good idea (I think this > can > > > > > probably > > > > > > > be > > > > > > > > > done > > > > > > > > > >> by combining the methods for embedding that have already > > > been > > > > > > > discussed > > > > > > > > > to > > > > > > > > > >> allow potentially 2 embedded footers). > > > > > > > > > >> > > > > > > > > > >> I think another question that Alkis's proposals raised > is > > > how > > > > > > > policies > > > > > > > > > on > > > > > > > > > >> deprecation of fields (especially ones that are > currently > > > > > required > > > > > > > in > > > > > > > > > >> PAR1). I think this is probably a better topic for > > > another > > > > > > thread, > > > > > > > I'll > > > > > > > > > >> try to write a PR formalizing a proposal on feature > > > evolution. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> [1] > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit > > > > > > > > > > > > > > > > > > > >> [2] > > > > > > > > https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo > > > > > > > > > >> [3] https://github.com/apache/parquet-format/pull/250 > > > > > > > > > >> > > > > > > > > > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield < > > > > > > > emkornfi...@gmail.com > > > > > > > > > > > > > > > > > > > >> wrote: > > > > > > > > > >> > > > > > > > > > >>> Hi Antoine, > > > > > > > > > >>> Thanks for the great points. Responses inline. > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>>> I like your attempt to put the "new" file metadata > after > > > the > > > > > > > legacy > > > > > > > > > >>>> one in > > https://github.com/apache/parquet-format/pull/250, > > > > > > > > and I > > > > > > > hope > > > > > > > > > it > > > > > > > > > >>>> can actually be made to work (it requires current > > > Parquet > > > > > > readers > > > > > > > to > > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the v1 > > > Thrift > > > > > > > metadata). > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> Thanks (I hope so too). I think the idea is originally > > > from > > > > > > > Alkis. If > > > > > > > > > >>> it doesn't work then there is always an option of > doing a > > > > > little > > > > > > > more > > > > > > > > > >>> involved process of making the footer look like an > > > unknown > > > > > binary > > > > > > > > > field (an > > > > > > > > > >>> approach I know you have objections to). > > > > > > > > > >>> > > > > > > > > > >>> I'm biased, but I find it much cleaner to define new > > > Thrift > > > > > > > > > >>>> structures (FileMetadataV3, etc.), rather than > > > painstakinly > > > > > > > document > > > > > > > > > >>>> which fields are to be omitted in V3. That would > > > achieve > > > > > three > > > > > > > > > goals: > > > > > > > > > >>>> 1) make the spec easier to read (even though it > would > > > be > > > > > > > physically > > > > > > > > > >>>> longer); 2) make it easier to produce a conformant > > > > > > > implementation > > > > > > > > > >>>> (special rules increase the risks of > misunderstandings > > > and > > > > > > > > > >>>> disagreements); 3) allow a later cleanup of the spec > > > once we > > > > > > > agree > > > > > > > > > to > > > > > > > > > >>>> get rid of V1 structs. > > > > > > > > > >>> > > > > > > > > > >>> There are trade-offs here. I agree with the benefits > you > > > > > listed > > > > > > > here. > > > > > > > > > >>> The benefits of reusing existing structs are: > > > > > > > > > >>> 1. Lowers the amount of boiler plate code mapping from > > one > > > to > > > > > the > > > > > > > other > > > > > > > > > >>> (i.e. simpler initial implementation), since I expect > it > > > will > > > > > be > > > > > > > a > > > > > > > > > while > > > > > > > > > >>> before we have standalone PAR3 files. > > > > > > > > > >>> 2. Allows for lower maintenance burden if there is > useful > > > new > > > > > > > metadata > > > > > > > > > >>> that we would like to see added to both structures > > > original and > > > > > > > "V3" > > > > > > > > > >>> structures. > > > > > > > > > >>> > > > > > > > > > >>> - The new encoding in that PR seems like it should be > > > moved to > > > > > a > > > > > > > > > >>>> separate PR and be discussed in the encodings > thread? > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> I'll cross post on that thread. The main reason I > > > included it > > > > > in > > > > > > > my > > > > > > > > > >>> proposal is I think it provides random access for > members > > > out > > > > > of > > > > > > > the > > > > > > > > > box > > > > > > > > > >>> (as compared to the existing encodings). I think this > > > mostly > > > > > > goes > > > > > > > to > > > > > > > > > your > > > > > > > > > >>> third-point so I'll discuss below. > > > > > > > > > >>> > > > > > > > > > >>> - I'm a bit skeptical about moving Thrift lists into > data > > > > > pages, > > > > > > > rather > > > > > > > > > >>>> than, say, just embed the corresponding Thrift > > > serialization > > > > > > as > > > > > > > > > >>>> binary fields for lazy deserialization. > > > > > > > > > >>> > > > > > > > > > >>> I think this falls into 2 different concerns: > > > > > > > > > >>> 1. The format of how we serialize metadata. > > > > > > > > > >>> 2. Where the serialized metadata lives. > > > > > > > > > >>> > > > > > > > > > >>> For concern #1, I think we should be considering > treating > > > these > > > > > > > lists > > > > > > > > > as > > > > > > > > > >>> actual parquet data pages. This allows users to tune > > this > > > to > > > > > > > their > > > > > > > > > needs > > > > > > > > > >>> for size vs decoding speed, and make use of any > > > improvements to > > > > > > > > > encoding > > > > > > > > > >>> that happen in the future without a spec change. I > think > > > this > > > > > is > > > > > > > likely > > > > > > > > > >>> fairly valuable given the number of systems that cache > > > this > > > > > data. > > > > > > > The > > > > > > > > > >>> reason I introduced the new encoding was to provide an > > > option > > > > > > > that > > > > > > > > > could be > > > > > > > > > >>> as efficient as possible from a compute perspective. > > > > > > > > > >>> > > > > > > > > > >>> For concern #2, there is no reason encoding a page as a > > > thrift > > > > > > > Binary > > > > > > > > > >>> field would not work. The main reason I raised putting > > > them > > > > > > > outside of > > > > > > > > > >>> thrift is for greater control on deserialization (the > > > main > > > > > > > benefit > > > > > > > > > being > > > > > > > > > >>> avoiding copies) for implementations that have a Thrift > > > parser > > > > > > > that > > > > > > > > > doesn't > > > > > > > > > >>> allow these optimizations. In terms of a path forward > > > here, I > > > > > > > think > > > > > > > > > >>> understanding the performance and memory > characteristics > > > of > > > > > each > > > > > > > > > approach. > > > > > > > > > >>> I agree, if there isn't substantial savings from having > > > them be > > > > > > > > > outside the > > > > > > > > > >>> page, then it just adds complexity. > > > > > > > > > >>> > > > > > > > > > >>> Thanks, > > > > > > > > > >>> Micah > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou < > > > > > > anto...@python.org > > > > > > > > > > > > > > > > > >>> wrote: > > > > > > > > > >>> > > > > > > > > > >>>> > > > > > > > > > >>>> Hello Micah, > > > > > > > > > >>>> > > > > > > > > > >>>> First, kudos for doing this! > > > > > > > > > >>>> > > > > > > > > > >>>> I like your attempt to put the "new" file metadata > after > > > the > > > > > > > legacy > > > > > > > > > >>>> one in > > https://github.com/apache/parquet-format/pull/250, > > > > > > > > and I > > > > > > > hope > > > > > > > > > it > > > > > > > > > >>>> can actually be made to work (it requires current > > > Parquet > > > > > > readers > > > > > > > to > > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the v1 > > > Thrift > > > > > > > metadata). > > > > > > > > > >>>> > > > > > > > > > >>>> Some assorted comments on other changes that PR is > > doing: > > > > > > > > > >>>> > > > > > > > > > >>>> - I'm biased, but I find it much cleaner to define new > > > Thrift > > > > > > > > > >>>> structures (FileMetadataV3, etc.), rather than > > > painstakinly > > > > > > > document > > > > > > > > > >>>> which fields are to be omitted in V3. That would > > > achieve > > > > > three > > > > > > > > > goals: > > > > > > > > > >>>> 1) make the spec easier to read (even though it > would > > > be > > > > > > > physically > > > > > > > > > >>>> longer); 2) make it easier to produce a conformant > > > > > > > implementation > > > > > > > > > >>>> (special rules increase the risks of > misunderstandings > > > and > > > > > > > > > >>>> disagreements); 3) allow a later cleanup of the spec > > > once we > > > > > > > agree > > > > > > > > > to > > > > > > > > > >>>> get rid of V1 structs. > > > > > > > > > >>>> > > > > > > > > > >>>> - The new encoding in that PR seems like it should be > > > moved > > > > > to a > > > > > > > > > >>>> separate PR and be discussed in the encodings > thread? > > > > > > > > > >>>> > > > > > > > > > >>>> - I'm a bit skeptical about moving Thrift lists into > > > data > > > > > pages, > > > > > > > > > rather > > > > > > > > > >>>> than, say, just embed the corresponding Thrift > > > serialization > > > > > > as > > > > > > > > > >>>> binary fields for lazy deserialization. > > > > > > > > > >>>> > > > > > > > > > >>>> Regards > > > > > > > > > >>>> > > > > > > > > > >>>> Antoine. > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> On Mon, 27 May 2024 23:06:37 -0700 > > > > > > > > > >>>> Micah Kornfield < > > > > > > > > > > > > > emkornfield-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org > > > > > > > > > > > > > > > >>>> wrote: > > > > > > > > > >>>> > As a follow-up to the "V3" Discussions [1][2] I > wanted > > > to > > > > > > start > > > > > > > a > > > > > > > > > >>>> thread on > > > > > > > > > >>>> > improvements to the footer metadata. > > > > > > > > > >>>> > > > > > > > > > > >>>> > Based on conversation so far, there have been a few > > > > > proposals > > > > > > > > > >>>> [3][4][5] to > > > > > > > > > >>>> > help better support files with wide schemas and many > > > > > > > row-groups. I > > > > > > > > > >>>> think > > > > > > > > > >>>> > there are a lot of interesting ideas in each. It > would > > > be > > > > > good > > > > > > > to > > > > > > > > > get > > > > > > > > > >>>> > further feedback on these to make sure we aren't > > > missing > > > > > > > anything > > > > > > > > > and > > > > > > > > > >>>> > define a minimal first iteration for doing > > > experimental > > > > > > > benchmarking > > > > > > > > > >>>> to > > > > > > > > > >>>> > prove out an approach. > > > > > > > > > >>>> > > > > > > > > > > >>>> > I think the next steps would ideally be: > > > > > > > > > >>>> > 1. Come to a consensus on the overall approach. > > > > > > > > > >>>> > 2. Prototypes to Benchmark/test to validate the > > > approaches > > > > > > > defined > > > > > > > > > >>>> (if we > > > > > > > > > >>>> > can't come to consensus in item #1, this might help > > > choose a > > > > > > > > > >>>> direction). > > > > > > > > > >>>> > 3. Divide up any final approach into as > fine-grained > > > > > features > > > > > > > as > > > > > > > > > >>>> possible. > > > > > > > > > >>>> > 4. Implement across parquet-java, parquet-cpp, > > > parquet-rs > > > > > > (and > > > > > > > any > > > > > > > > > >>>> other > > > > > > > > > >>>> > implementations that we can get volunteers for). > > > > > > Additionally, > > > > > > > if > > > > > > > > > >>>> new APIs > > > > > > > > > >>>> > are needed to make use of the new structure, it > would > > > be > > > > > good > > > > > > > to try > > > > > > > > > >>>> to > > > > > > > > > >>>> > prototype against consumers of Parquet. > > > > > > > > > >>>> > > > > > > > > > > >>>> > Knowing that we have enough people interested in > doing > > > #3 is > > > > > > > > > critical > > > > > > > > > >>>> to > > > > > > > > > >>>> > success, so if you have time to devote, it would be > > > helpful > > > > > to > > > > > > > chime > > > > > > > > > >>>> in > > > > > > > > > >>>> > here (I know some people already noted they could > help > > > in > > > > > the > > > > > > > > > original > > > > > > > > > >>>> > thread). > > > > > > > > > >>>> > > > > > > > > > > >>>> > I think it is likely we will need either an in > person > > > sync > > > > > or > > > > > > > > > another > > > > > > > > > >>>> more > > > > > > > > > >>>> > focused design document could help. I am happy to > try > > > to > > > > > > > facilitate > > > > > > > > > >>>> this > > > > > > > > > >>>> > (once we have a better sense of who wants to be > > > involved and > > > > > > > what > > > > > > > > > time > > > > > > > > > >>>> > zones they are in I can schedule a sync if > necessary). > > > > > > > > > >>>> > > > > > > > > > > >>>> > Thanks, > > > > > > > > > >>>> > Micah > > > > > > > > > >>>> > > > > > > > > > > >>>> > [1] > > > > > > > > > > > > https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo > > > > > > > > > >>>> > [2] > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit > > > > > > > > > > > > > > > > > > > >>>> > [3] > https://github.com/apache/parquet-format/pull/242 > > > > > > > > > >>>> > [4] > https://github.com/apache/parquet-format/pull/248 > > > > > > > > > >>>> > [5] > https://github.com/apache/parquet-format/pull/250 > > > > > > > > > >>>> > > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > >>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >