Parquet File Meta Data & Compatibility

Jan Finis Fri, 16 Oct 2020 15:27:56 -0700

Hey folks,

First of all, thanks for this great project!


I am currently writing a library for reading/writing parquet, and I am a bit 
confused by some points, which I would like to discuss here. I think they will 
be relevant to anyone wanting to write own parquet reading/writing logic in a 
new language.


MetaData:

There are some fields in the metadata whose semantics is unclear. Can you 
clarify this:

ColumnChunk.file_offset:

* The thrift definition says that this is "Byte offset in file_path to the 
ColumnMetaData" but this doesn't make too much sense, as the ColumnMetaData is 
contained inside the ColumnChunk itself (and therefore, you cannot even know 
its offset when writing the column chunk).
parquet-mr seems to just write the offset of the first data/dict page into this 
field, which doesn't seem to comply with what the spec says (but is at least 
possible). What is this field supposed to be used for? My reader just ignores 
it, but my writer should make sure to write the most sensible value in here, in 
case some reader relies on it.

RowGroup.ordinal:

* This seems to be just the ordinal of the row group. However, this information 
seems redundant, as the row-groups meta data is already stored in a specific 
order in the footer. Should the ordinal just be equal to that order, or can it 
differ?
* Will any reader rely on this, since it's optional?
* This is defined to be int16_t, which can quickly overflow for very large 
parquet files or smaller row groups. What should I do if I anticipate that my 
library will be used to write files where this will overflow? Just use it for 
the first 2^15 row groups and then leave it out? Or don't write it at all for 
any row group?

Compatibility:


  1.  For writing: What compatibility versions are there? The spec talks a lot 
about compatibility and features that not all readers can read, but it never 
specifies things like "this encoding is version X.Y and upwards".
So, when writing a parquet file, I have some problems in choosing features.
E.g.,
* should I write DataPageV1 or DataPageV2?
* Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
* Should I use BYTE_STREAM_SPLIT?

  2.  For reading: I have implemented all encodings except BIT_PACKED, which 
seems deprecated for a long time (and would require all the 
bitunpack-on-big-endian logic, which would be a lot of work). How safe can I be 
that this encoding is no longer used? When was it last used? Since when is it 
deprecated?


IMHO, shouldn't the spec mention - quite precisely - what versions exist and 
what features can be used in which version, so an implementation can say "yes, 
I can fully write this versions" or "no, I can't" instead of having a fuzzy set 
of features where some are described to "not work on most readers". Even if 
such a precise mapping isn't defined today, shouldn't it be defined at least in 
retrospect, so that implementations can start checking and documenting, what 
versions they can write?

To give you an example where I already encountered this: parquet-cpp doesn't 
seem to be able to read DELTA_BINARY_PACKED yet, so any software built on top 
of it (e.g., pyarrow) cannot read such files. I created such a file with 
parquet-mr and was very surprised when pyarrow couldn't read it.




Source of truth:

In general, what is the agreed-upon source of truth for parquet? Is it the 
documents parquet-format, or is it the implementation in parquet-mr? These 
differ sometimes, so which one should I adhere to if they do?


Thanks in advance for any answers.
Cheers,
Jan

Parquet File Meta Data & Compatibility

Reply via email to