I tried to get some level of clarification in a PR [1]. It kind of stalled because we had further conversation on a sync call a while ago that I have not had a chance to follow-up on. I'm happy to revise if we can come up with some sort of consensus for experimental/non-experimental.
The things that were decided on the sync call were: 1. Does it make sense to deprecate data page v2. It sounded like this isn't widely used and the main benefits are either unused [uncompressed rep/def level], or are already required from other features (e.g. record level alignment within a data page is required with indices IIUC). 2. Potentially accepting that all encodings in V1 and V2 are acceptable for V1 data pages. I think there is a lot more conversation here. But I agree with everything said on this thread we more clarity on what is required/not required for implementations. e.g. I don't think the delta encoding reflects current best practices in > the literature. Could you expand on this? [1] https://github.com/apache/parquet-format/pull/163 On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <[email protected]> wrote: > On Fri, 4 Dec 2020 11:21:58 -0800 > Tim Armstrong > <[email protected]> wrote: > > I probably didn't say it very clearly, but my opinion as a consumer of > the > > Parquet spec is that the format needs a reset where encodings, logical > > types and other metadata that are not widely adopted are removed from the > > core spec and put in an "experimental" category from which we can later > > promote things based on community consensus. > > I'm not a participant in this community but I'm one of the people who > maintain parquet-cpp as part the Arrow C++ codebase, and your proposal > sounds like a very good idea to me. > > Even the core parts of Parquet can be tricky to implement efficiently, > adding more and more features and encodings is IMHO very dangerous for > the format's portability. > > Regards > > Antoine. > > > > > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <[email protected]> > > wrote: > > > > > I think it would be good for the project to define a core set of > features > > > that a Parquet implementation must support to be able to correctly read > > > files all written by another compliant writer with the same version. > > > > > > There are then additional extensions like page indices that are not > > > required to actually be able to read/write the files correctly. > > > > > > My understanding of how we got here is that the Parquet 2.0 spec > defined a > > > number of new features and encodings that were only implemented in > > > parquet-mr. Many of these have not gotten widely implemented, and I'm > also > > > not sure that the encodings are particularly well designed, e.g. I > don't > > > think the delta encoding reflects current best practices in the > literature. > > > > > > E.g. in Impala we did not implement many of the new Parquet 2.0 > encodings > > > because it was a significant amount of work and we had a lot of > competing > > > priorities. It's now a chicken and egg problem because there's isn't > much > > > data encoded in that way out there and thus not much motivation to > invest > > > in the work. We've had some similar headaches with logic types and > > > encodings, e.g. where the standard allows several degrees of freedom > in how > > > to encode DECIMALs, including ones that don't make much sense. We've > wasted > > > significant time adding support for DECIMALs that were written in > strange > > > ways by another writer. > > > > > > In the meantime we did actually need to add other new features to > Parquet > > > like page indices, so moved forward with those, which is part of how > it's > > > turned into a bit of a matrix. > > > > > > My suggestion is to go through the encodings and features and flag the > > > ones that are not universally implemented as "EXPERIMENTAL" or > something > > > similar, and make it clear that they are optional and may be > deprecated if > > > they don't get adopted. I think that would give implementations a > cleaner > > > baseline to compare against, then we can start adding core features on > top > > > of that. > > > > > > IMO we shouldn't promote features to being part of the core spec until > > > there are at least 2 reference implementations that work with each > other. > > > That way we can use the reference implementations to clarify any bugs > in > > > the spec. > > > > > > This is all kinda hard and a little thankless so I don't blame anyone > for > > > not taking it on yet! > > > > > > > * The thrift definition says that this is "Byte offset in file_path > to > > > the ColumnMetaData" but this doesn't make too much sense, as the > > > ColumnMetaData is contained inside the ColumnChunk itself (and > therefore, > > > you cannot even know its offset when writing the column chunk). > > > It probably makes sense to update the spec to reflect what actual > > > implementations do. > > > > > > > * This is defined to be int16_t, which can quickly overflow for > very > > > large parquet files or smaller row groups. What should I do if I > anticipate > > > that my library will be used to write files where this will overflow? > Just > > > use it for the first 2^15 row groups and then leave it out? Or don't > write > > > it at all for any row group? > > > > > > 16k+ row groups in a file seems goofy to me. I can't think of a way > that > > > you could hit that limit with a reasonable file layout. Row groups are > > > meant to be big enough that you amortise the metadata overhead and get > good > > > compression over a column chunk. I'd expect row groups to be 10s of MB > at > > > least typically. Maybe this would be something that should be > clarified in > > > the spec or other docs. > > > > > > > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <[email protected] > > > > > wrote: > > > > > >> > > > >> > IMHO, shouldn't the spec mention - quite precisely - what versions > exist > > >> > and what features can be used in which version, so an > implementation can > > >> > say "yes, I can fully write this versions" or "no, I can't" instead > of > > >> > having a fuzzy set of features where some are described to "not > work on > > >> > most readers". Even if such a precise mapping isn't defined today, > > >> > shouldn't it be defined at least in retrospect, so that > implementations > > >> can > > >> > start checking and documenting, what versions they can write? > > >> > > >> > > >> Just an FYI I've already raised these concerns recently [1]. But I > > >> completely agree this is needed, I'd offer to update the docs, but I > don't > > >> really have the historical context to do it. > > >> > > >> [1] > > >> > > >> > https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E > > >> > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis <[email protected] > > > > >> wrote: > > >> > > >> > Hey folks, > > >> > > > >> > First of all, thanks for this great project! > > >> > > > >> > I am currently writing a library for reading/writing parquet, and I > am a > > >> > bit confused by some points, which I would like to discuss here. I > think > > >> > they will be relevant to anyone wanting to write own parquet > > >> > reading/writing logic in a new language. > > >> > > > >> > > > >> > MetaData: > > >> > > > >> > There are some fields in the metadata whose semantics is unclear. > Can > > >> you > > >> > clarify this: > > >> > > > >> > ColumnChunk.file_offset: > > >> > > > >> > * The thrift definition says that this is "Byte offset in file_path > to > > >> the > > >> > ColumnMetaData" but this doesn't make too much sense, as the > > >> ColumnMetaData > > >> > is contained inside the ColumnChunk itself (and therefore, you > cannot > > >> even > > >> > know its offset when writing the column chunk). > > >> > parquet-mr seems to just write the offset of the first data/dict > page > > >> into > > >> > this field, which doesn't seem to comply with what the spec says > (but > > >> is at > > >> > least possible). What is this field supposed to be used for? My > reader > > >> just > > >> > ignores it, but my writer should make sure to write the most > sensible > > >> value > > >> > in here, in case some reader relies on it. > > >> > > > >> > RowGroup.ordinal: > > >> > > > >> > * This seems to be just the ordinal of the row group. However, this > > >> > information seems redundant, as the row-groups meta data is > already > > >> stored > > >> > in a specific order in the footer. Should the ordinal just be equal > to > > >> that > > >> > order, or can it differ? > > >> > * Will any reader rely on this, since it's optional? > > >> > * This is defined to be int16_t, which can quickly overflow for > very > > >> large > > >> > parquet files or smaller row groups. What should I do if I > anticipate > > >> that > > >> > my library will be used to write files where this will overflow? > Just > > >> use > > >> > it for the first 2^15 row groups and then leave it out? Or don't > write > > >> it > > >> > at all for any row group? > > >> > > > >> > Compatibility: > > >> > > > >> > > > >> > 1. For writing: What compatibility versions are there? The spec > > >> talks a > > >> > lot about compatibility and features that not all readers can read, > but > > >> it > > >> > never specifies things like "this encoding is version X.Y and > upwards". > > >> > So, when writing a parquet file, I have some problems in choosing > > >> features. > > >> > E.g., > > >> > * should I write DataPageV1 or DataPageV2? > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED? > > >> > * Should I use BYTE_STREAM_SPLIT? > > >> > > > >> > 2. For reading: I have implemented all encodings except > BIT_PACKED, > > >> > which seems deprecated for a long time (and would require all the > > >> > bitunpack-on-big-endian logic, which would be a lot of work). How > safe > > >> can > > >> > I be that this encoding is no longer used? When was it last used? > Since > > >> > when is it deprecated? > > >> > > > >> > > > >> > IMHO, shouldn't the spec mention - quite precisely - what versions > exist > > >> > and what features can be used in which version, so an > implementation can > > >> > say "yes, I can fully write this versions" or "no, I can't" instead > of > > >> > having a fuzzy set of features where some are described to "not > work on > > >> > most readers". Even if such a precise mapping isn't defined today, > > >> > shouldn't it be defined at least in retrospect, so that > implementations > > >> can > > >> > start checking and documenting, what versions they can write? > > >> > > > >> > To give you an example where I already encountered this: parquet-cpp > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any > software > > >> > built on top of it (e.g., pyarrow) cannot read such files. I > created > > >> such a > > >> > file with parquet-mr and was very surprised when pyarrow couldn't > read > > >> it. > > >> > > > >> > > > >> > > > >> > > > >> > Source of truth: > > >> > > > >> > In general, what is the agreed-upon source of truth for parquet? Is > it > > >> the > > >> > documents parquet-format, or is it the implementation in > parquet-mr? > > >> These > > >> > differ sometimes, so which one should I adhere to if they do? > > >> > > > >> > > > >> > Thanks in advance for any answers. > > >> > Cheers, > > >> > Jan > > >> > > > >> > > > > > > > > >
