Hey folks, First of all, thanks for this great project!
I am currently writing a library for reading/writing parquet, and I am a bit confused by some points, which I would like to discuss here. I think they will be relevant to anyone wanting to write own parquet reading/writing logic in a new language. MetaData: There are some fields in the metadata whose semantics is unclear. Can you clarify this: ColumnChunk.file_offset: * The thrift definition says that this is "Byte offset in file_path to the ColumnMetaData" but this doesn't make too much sense, as the ColumnMetaData is contained inside the ColumnChunk itself (and therefore, you cannot even know its offset when writing the column chunk). parquet-mr seems to just write the offset of the first data/dict page into this field, which doesn't seem to comply with what the spec says (but is at least possible). What is this field supposed to be used for? My reader just ignores it, but my writer should make sure to write the most sensible value in here, in case some reader relies on it. RowGroup.ordinal: * This seems to be just the ordinal of the row group. However, this information seems redundant, as the row-groups meta data is already stored in a specific order in the footer. Should the ordinal just be equal to that order, or can it differ? * Will any reader rely on this, since it's optional? * This is defined to be int16_t, which can quickly overflow for very large parquet files or smaller row groups. What should I do if I anticipate that my library will be used to write files where this will overflow? Just use it for the first 2^15 row groups and then leave it out? Or don't write it at all for any row group? Compatibility: 1. For writing: What compatibility versions are there? The spec talks a lot about compatibility and features that not all readers can read, but it never specifies things like "this encoding is version X.Y and upwards". So, when writing a parquet file, I have some problems in choosing features. E.g., * should I write DataPageV1 or DataPageV2? * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED? * Should I use BYTE_STREAM_SPLIT? 2. For reading: I have implemented all encodings except BIT_PACKED, which seems deprecated for a long time (and would require all the bitunpack-on-big-endian logic, which would be a lot of work). How safe can I be that this encoding is no longer used? When was it last used? Since when is it deprecated? IMHO, shouldn't the spec mention - quite precisely - what versions exist and what features can be used in which version, so an implementation can say "yes, I can fully write this versions" or "no, I can't" instead of having a fuzzy set of features where some are described to "not work on most readers". Even if such a precise mapping isn't defined today, shouldn't it be defined at least in retrospect, so that implementations can start checking and documenting, what versions they can write? To give you an example where I already encountered this: parquet-cpp doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any software built on top of it (e.g., pyarrow) cannot read such files. I created such a file with parquet-mr and was very surprised when pyarrow couldn't read it. Source of truth: In general, what is the agreed-upon source of truth for parquet? Is it the documents parquet-format, or is it the implementation in parquet-mr? These differ sometimes, so which one should I adhere to if they do? Thanks in advance for any answers. Cheers, Jan
