Re: Parquet File Meta Data & Compatibility

Micah Kornfield Fri, 16 Oct 2020 15:47:01 -0700

>
> IMHO, shouldn't the spec mention - quite precisely - what versions exist
> and what features can be used in which version, so an implementation can
> say "yes, I can fully write this versions" or "no, I can't" instead of
> having a fuzzy set of features where some are described to "not work on
> most readers". Even if such a precise mapping isn't defined today,
> shouldn't it be defined at least in retrospect, so that implementations can
> start checking and documenting, what versions they can write?



Just an FYI I've already raised these concerns recently [1].  But I
completely agree this is needed, I'd offer to update the docs, but I don't
really have the historical context to do it.

[1]
https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E

On Fri, Oct 16, 2020 at 3:27 PM Jan Finis <[email protected]>
wrote:

> Hey folks,
>
> First of all, thanks for this great project!
>
> I am currently writing a library for reading/writing parquet, and I am a
> bit confused by some points, which I would like to discuss here. I think
> they will be relevant to anyone wanting to write own parquet
> reading/writing logic in a new language.
>
>
> MetaData:
>
> There are some fields in the metadata whose semantics is unclear. Can you
> clarify this:
>
> ColumnChunk.file_offset:
>
> * The thrift definition says that this is "Byte offset in file_path to the
> ColumnMetaData" but this doesn't make too much sense, as the ColumnMetaData
> is contained inside the ColumnChunk itself (and therefore, you cannot even
> know its offset when writing the column chunk).
> parquet-mr seems to just write the offset of the first data/dict page into
> this field, which doesn't seem to comply with what the spec says (but is at
> least possible). What is this field supposed to be used for? My reader just
> ignores it, but my writer should make sure to write the most sensible value
> in here, in case some reader relies on it.
>
> RowGroup.ordinal:
>
> * This seems to be just the ordinal of the row group. However, this
> information seems redundant, as the row-groups meta data is already stored
> in a specific order in the footer. Should the ordinal just be equal to that
> order, or can it differ?
> * Will any reader rely on this, since it's optional?
> * This is defined to be int16_t, which can quickly overflow for very large
> parquet files or smaller row groups. What should I do if I anticipate that
> my library will be used to write files where this will overflow? Just use
> it for the first 2^15 row groups and then leave it out? Or don't write it
> at all for any row group?
>
> Compatibility:
>
>
>   1.  For writing: What compatibility versions are there? The spec talks a
> lot about compatibility and features that not all readers can read, but it
> never specifies things like "this encoding is version X.Y and upwards".
> So, when writing a parquet file, I have some problems in choosing features.
> E.g.,
> * should I write DataPageV1 or DataPageV2?
> * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> * Should I use BYTE_STREAM_SPLIT?
>
>   2.  For reading: I have implemented all encodings except BIT_PACKED,
> which seems deprecated for a long time (and would require all the
> bitunpack-on-big-endian logic, which would be a lot of work). How safe can
> I be that this encoding is no longer used? When was it last used? Since
> when is it deprecated?
>
>
> IMHO, shouldn't the spec mention - quite precisely - what versions exist
> and what features can be used in which version, so an implementation can
> say "yes, I can fully write this versions" or "no, I can't" instead of
> having a fuzzy set of features where some are described to "not work on
> most readers". Even if such a precise mapping isn't defined today,
> shouldn't it be defined at least in retrospect, so that implementations can
> start checking and documenting, what versions they can write?
>
> To give you an example where I already encountered this: parquet-cpp
> doesn't seem to be able to read DELTA_BINARY_PACKED yet, so any software
> built on top of it (e.g., pyarrow) cannot read such files. I created such a
> file with parquet-mr and was very surprised when pyarrow couldn't read it.
>
>
>
>
> Source of truth:
>
> In general, what is the agreed-upon source of truth for parquet? Is it the
> documents parquet-format, or is it the implementation in parquet-mr? These
> differ sometimes, so which one should I adhere to if they do?
>
>
> Thanks in advance for any answers.
> Cheers,
> Jan
>

Re: Parquet File Meta Data & Compatibility

Reply via email to