> > IMHO, shouldn't the spec mention - quite precisely - what versions exist > and what features can be used in which version, so an implementation can > say "yes, I can fully write this versions" or "no, I can't" instead of > having a fuzzy set of features where some are described to "not work on > most readers". Even if such a precise mapping isn't defined today, > shouldn't it be defined at least in retrospect, so that implementations can > start checking and documenting, what versions they can write?
Just an FYI I've already raised these concerns recently [1]. But I completely agree this is needed, I'd offer to update the docs, but I don't really have the historical context to do it. [1] https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E On Fri, Oct 16, 2020 at 3:27 PM Jan Finis <[email protected]> wrote: > Hey folks, > > First of all, thanks for this great project! > > I am currently writing a library for reading/writing parquet, and I am a > bit confused by some points, which I would like to discuss here. I think > they will be relevant to anyone wanting to write own parquet > reading/writing logic in a new language. > > > MetaData: > > There are some fields in the metadata whose semantics is unclear. Can you > clarify this: > > ColumnChunk.file_offset: > > * The thrift definition says that this is "Byte offset in file_path to the > ColumnMetaData" but this doesn't make too much sense, as the ColumnMetaData > is contained inside the ColumnChunk itself (and therefore, you cannot even > know its offset when writing the column chunk). > parquet-mr seems to just write the offset of the first data/dict page into > this field, which doesn't seem to comply with what the spec says (but is at > least possible). What is this field supposed to be used for? My reader just > ignores it, but my writer should make sure to write the most sensible value > in here, in case some reader relies on it. > > RowGroup.ordinal: > > * This seems to be just the ordinal of the row group. However, this > information seems redundant, as the row-groups meta data is already stored > in a specific order in the footer. Should the ordinal just be equal to that > order, or can it differ? > * Will any reader rely on this, since it's optional? > * This is defined to be int16_t, which can quickly overflow for very large > parquet files or smaller row groups. What should I do if I anticipate that > my library will be used to write files where this will overflow? Just use > it for the first 2^15 row groups and then leave it out? Or don't write it > at all for any row group? > > Compatibility: > > > 1. For writing: What compatibility versions are there? The spec talks a > lot about compatibility and features that not all readers can read, but it > never specifies things like "this encoding is version X.Y and upwards". > So, when writing a parquet file, I have some problems in choosing features. > E.g., > * should I write DataPageV1 or DataPageV2? > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED? > * Should I use BYTE_STREAM_SPLIT? > > 2. For reading: I have implemented all encodings except BIT_PACKED, > which seems deprecated for a long time (and would require all the > bitunpack-on-big-endian logic, which would be a lot of work). How safe can > I be that this encoding is no longer used? When was it last used? Since > when is it deprecated? > > > IMHO, shouldn't the spec mention - quite precisely - what versions exist > and what features can be used in which version, so an implementation can > say "yes, I can fully write this versions" or "no, I can't" instead of > having a fuzzy set of features where some are described to "not work on > most readers". Even if such a precise mapping isn't defined today, > shouldn't it be defined at least in retrospect, so that implementations can > start checking and documenting, what versions they can write? > > To give you an example where I already encountered this: parquet-cpp > doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any software > built on top of it (e.g., pyarrow) cannot read such files. I created such a > file with parquet-mr and was very surprised when pyarrow couldn't read it. > > > > > Source of truth: > > In general, what is the agreed-upon source of truth for parquet? Is it the > documents parquet-format, or is it the implementation in parquet-mr? These > differ sometimes, so which one should I adhere to if they do? > > > Thanks in advance for any answers. > Cheers, > Jan >
