I probably didn't say it very clearly, but my opinion as a consumer of the Parquet spec is that the format needs a reset where encodings, logical types and other metadata that are not widely adopted are removed from the core spec and put in an "experimental" category from which we can later promote things based on community consensus.
On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <[email protected]> wrote: > I think it would be good for the project to define a core set of features > that a Parquet implementation must support to be able to correctly read > files all written by another compliant writer with the same version. > > There are then additional extensions like page indices that are not > required to actually be able to read/write the files correctly. > > My understanding of how we got here is that the Parquet 2.0 spec defined a > number of new features and encodings that were only implemented in > parquet-mr. Many of these have not gotten widely implemented, and I'm also > not sure that the encodings are particularly well designed, e.g. I don't > think the delta encoding reflects current best practices in the literature. > > E.g. in Impala we did not implement many of the new Parquet 2.0 encodings > because it was a significant amount of work and we had a lot of competing > priorities. It's now a chicken and egg problem because there's isn't much > data encoded in that way out there and thus not much motivation to invest > in the work. We've had some similar headaches with logic types and > encodings, e.g. where the standard allows several degrees of freedom in how > to encode DECIMALs, including ones that don't make much sense. We've wasted > significant time adding support for DECIMALs that were written in strange > ways by another writer. > > In the meantime we did actually need to add other new features to Parquet > like page indices, so moved forward with those, which is part of how it's > turned into a bit of a matrix. > > My suggestion is to go through the encodings and features and flag the > ones that are not universally implemented as "EXPERIMENTAL" or something > similar, and make it clear that they are optional and may be deprecated if > they don't get adopted. I think that would give implementations a cleaner > baseline to compare against, then we can start adding core features on top > of that. > > IMO we shouldn't promote features to being part of the core spec until > there are at least 2 reference implementations that work with each other. > That way we can use the reference implementations to clarify any bugs in > the spec. > > This is all kinda hard and a little thankless so I don't blame anyone for > not taking it on yet! > > > * The thrift definition says that this is "Byte offset in file_path to > the ColumnMetaData" but this doesn't make too much sense, as the > ColumnMetaData is contained inside the ColumnChunk itself (and therefore, > you cannot even know its offset when writing the column chunk). > It probably makes sense to update the spec to reflect what actual > implementations do. > > > * This is defined to be int16_t, which can quickly overflow for very > large parquet files or smaller row groups. What should I do if I anticipate > that my library will be used to write files where this will overflow? Just > use it for the first 2^15 row groups and then leave it out? Or don't write > it at all for any row group? > > 16k+ row groups in a file seems goofy to me. I can't think of a way that > you could hit that limit with a reasonable file layout. Row groups are > meant to be big enough that you amortise the metadata overhead and get good > compression over a column chunk. I'd expect row groups to be 10s of MB at > least typically. Maybe this would be something that should be clarified in > the spec or other docs. > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <[email protected]> > wrote: > >> > >> > IMHO, shouldn't the spec mention - quite precisely - what versions exist >> > and what features can be used in which version, so an implementation can >> > say "yes, I can fully write this versions" or "no, I can't" instead of >> > having a fuzzy set of features where some are described to "not work on >> > most readers". Even if such a precise mapping isn't defined today, >> > shouldn't it be defined at least in retrospect, so that implementations >> can >> > start checking and documenting, what versions they can write? >> >> >> Just an FYI I've already raised these concerns recently [1]. But I >> completely agree this is needed, I'd offer to update the docs, but I don't >> really have the historical context to do it. >> >> [1] >> >> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E >> >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis <[email protected]> >> wrote: >> >> > Hey folks, >> > >> > First of all, thanks for this great project! >> > >> > I am currently writing a library for reading/writing parquet, and I am a >> > bit confused by some points, which I would like to discuss here. I think >> > they will be relevant to anyone wanting to write own parquet >> > reading/writing logic in a new language. >> > >> > >> > MetaData: >> > >> > There are some fields in the metadata whose semantics is unclear. Can >> you >> > clarify this: >> > >> > ColumnChunk.file_offset: >> > >> > * The thrift definition says that this is "Byte offset in file_path to >> the >> > ColumnMetaData" but this doesn't make too much sense, as the >> ColumnMetaData >> > is contained inside the ColumnChunk itself (and therefore, you cannot >> even >> > know its offset when writing the column chunk). >> > parquet-mr seems to just write the offset of the first data/dict page >> into >> > this field, which doesn't seem to comply with what the spec says (but >> is at >> > least possible). What is this field supposed to be used for? My reader >> just >> > ignores it, but my writer should make sure to write the most sensible >> value >> > in here, in case some reader relies on it. >> > >> > RowGroup.ordinal: >> > >> > * This seems to be just the ordinal of the row group. However, this >> > information seems redundant, as the row-groups meta data is already >> stored >> > in a specific order in the footer. Should the ordinal just be equal to >> that >> > order, or can it differ? >> > * Will any reader rely on this, since it's optional? >> > * This is defined to be int16_t, which can quickly overflow for very >> large >> > parquet files or smaller row groups. What should I do if I anticipate >> that >> > my library will be used to write files where this will overflow? Just >> use >> > it for the first 2^15 row groups and then leave it out? Or don't write >> it >> > at all for any row group? >> > >> > Compatibility: >> > >> > >> > 1. For writing: What compatibility versions are there? The spec >> talks a >> > lot about compatibility and features that not all readers can read, but >> it >> > never specifies things like "this encoding is version X.Y and upwards". >> > So, when writing a parquet file, I have some problems in choosing >> features. >> > E.g., >> > * should I write DataPageV1 or DataPageV2? >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED? >> > * Should I use BYTE_STREAM_SPLIT? >> > >> > 2. For reading: I have implemented all encodings except BIT_PACKED, >> > which seems deprecated for a long time (and would require all the >> > bitunpack-on-big-endian logic, which would be a lot of work). How safe >> can >> > I be that this encoding is no longer used? When was it last used? Since >> > when is it deprecated? >> > >> > >> > IMHO, shouldn't the spec mention - quite precisely - what versions exist >> > and what features can be used in which version, so an implementation can >> > say "yes, I can fully write this versions" or "no, I can't" instead of >> > having a fuzzy set of features where some are described to "not work on >> > most readers". Even if such a precise mapping isn't defined today, >> > shouldn't it be defined at least in retrospect, so that implementations >> can >> > start checking and documenting, what versions they can write? >> > >> > To give you an example where I already encountered this: parquet-cpp >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any software >> > built on top of it (e.g., pyarrow) cannot read such files. I created >> such a >> > file with parquet-mr and was very surprised when pyarrow couldn't read >> it. >> > >> > >> > >> > >> > Source of truth: >> > >> > In general, what is the agreed-upon source of truth for parquet? Is it >> the >> > documents parquet-format, or is it the implementation in parquet-mr? >> These >> > differ sometimes, so which one should I adhere to if they do? >> > >> > >> > Thanks in advance for any answers. >> > Cheers, >> > Jan >> > >> >
