Should we default to write parquet format version 2.0? (not data page version 2.0)

Joris Van den Bossche Tue, 15 Dec 2020 08:24:23 -0800

Hi all,

(somewhat related to the discussion on the parquet mailing list about
compatibility of different features in the format and to which format
version they belong, which triggered
https://github.com/apache/parquet-format/pull/164. But mainly related in
the sense that it is rather unclear to me which features are enabled when,
also for the column types).


Currently, when writing parquet files with Arrow (parquet-cpp), we default
to parquet format "1.0".
This means we don't use ConvertedTypes or LogicalTypes introduced in format
"2.0"+.
For example, this means we can (by default) only write int32 and int64
integer types, and not any of the other signed and unsigned integer types.

But, most of the additional ConvertedTypes that were not present in
parquet-format 1.0 (eg the different signed/unsigned integer types,
timestamp, ..) were introduced in parquet-format 2.2 (
https://github.com/apache/parquet-format/pull/3,
https://issues.apache.org/jira/browse/PARQUET-12) almost 7 years ago.
The LogicalTypes are certainly more recent, but so with the current options
of "1.0" or "2.0" when writing, you can either choose for all new features
of the different 2.x release (both ConvertedTypes and LogicalTypes), or
none of those (e.g. we don't have the option to say `version="2.2"`).

When we are saying "version 2.0 is not yet recommended for production use"
(because many other readers are not yet compatible with it), isn't this
mostly about the *data page* version 2 (which, AFAIU, is separate from the
format version 2.0).
If so, could we start defaulting to version 2.0 (but still with date page
version 1.0), or do other parquet readers actually not yet support the
ConvertedTypes introduced 7 years ago?

Best,
Joris

Should we default to write parquet format version 2.0? (not data page version 2.0)

Reply via email to