Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Joris Van den Bossche Tue, 15 Dec 2020 09:00:06 -0800

On Tue, 15 Dec 2020 at 17:24, Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:


>
> Currently, when writing parquet files with Arrow (parquet-cpp), we default
> to parquet format "1.0".
> This means we don't use ConvertedTypes or LogicalTypes introduced in
> format "2.0"+.
> For example, this means we can (by default) only write int32 and int64
> integer types, and not any of the other signed and unsigned integer types.
>
> I have to correct myself here. Apparently, we DO write version 2.0
Converted/LogicalTypes by default with `version="1.0"` ...
But just not for integer types, so I got a wrong impression here by only
looking at ints when writing the previous email.

A small example using python at the bottom. From that, it seems that we
actually do write Converted/LogicalType annotations, even with
`version="1.0"` for types like Date and Decimal, which were only added in
format version 2.x.
I assume the reason we do that for those types, and not for the integer
types, is that Date (as int32) or Decimal (variable size binary) can be
read "correctly" as the physical type as well (only with information loss).
While eg UINT_32 converted type is written as INT32 physical type, so not
interpreting the type annotation would lead to wrong numbers.

import pyarrow as pa
import pyarrow.parquet as pq
import decimal

table = pa.table({
    "int64": [1, 2, 3],
    "int32": pa.array([1, 2, 3], type="int32"),
    "uint32": pa.array([1, 2, 3], type="uint32"),
    "date": pa.array([1, 2, 3], type=pa.date32()),
    "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")),
    "string": pa.array(["a", "b", "c"]),
    "list": pa.array([[1, 2], [3, 4], [5, 6]]),
    "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"),
decimal.Decimal("3.0")]),
})

pq.write_table(table, "test_v1.parquet", version="1.0")
pq.write_table(table, "test_v2.parquet", version="2.0")

In [55]: pq.read_metadata("test_v1.parquet").schema
Out[55]:
<pyarrow._parquet.ParquetSchema object at 0x7f839b7602c0>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int64 field_id=3 uint32;
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
timeUnit=milliseconds, is_from_converted_type=false,
force_set_converted_type=false));
  optional binary field_id=6 string (String);
  optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2,
scale=1));
}

In [56]: pq.read_metadata("test_v2.parquet").schema
Out[56]:
<pyarrow._parquet.ParquetSchema object at 0x7f839bc43f80>
required group field_id=0 schema {
  optional int64 field_id=1 int64;
  optional int32 field_id=2 int32;
  optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false));
  optional int32 field_id=4 date (Date);
  optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true,
timeUnit=milliseconds, is_from_converted_type=false,
force_set_converted_type=false));
  optional binary field_id=6 string (String);
  optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2,
scale=1));

Re: Should we default to write parquet format version 2.0? (not data page version 2.0)

Reply via email to