On Tue, 15 Dec 2020 at 17:24, Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote:
> > Currently, when writing parquet files with Arrow (parquet-cpp), we default > to parquet format "1.0". > This means we don't use ConvertedTypes or LogicalTypes introduced in > format "2.0"+. > For example, this means we can (by default) only write int32 and int64 > integer types, and not any of the other signed and unsigned integer types. > > I have to correct myself here. Apparently, we DO write version 2.0 Converted/LogicalTypes by default with `version="1.0"` ... But just not for integer types, so I got a wrong impression here by only looking at ints when writing the previous email. A small example using python at the bottom. From that, it seems that we actually do write Converted/LogicalType annotations, even with `version="1.0"` for types like Date and Decimal, which were only added in format version 2.x. I assume the reason we do that for those types, and not for the integer types, is that Date (as int32) or Decimal (variable size binary) can be read "correctly" as the physical type as well (only with information loss). While eg UINT_32 converted type is written as INT32 physical type, so not interpreting the type annotation would lead to wrong numbers. import pyarrow as pa import pyarrow.parquet as pq import decimal table = pa.table({ "int64": [1, 2, 3], "int32": pa.array([1, 2, 3], type="int32"), "uint32": pa.array([1, 2, 3], type="uint32"), "date": pa.array([1, 2, 3], type=pa.date32()), "timestamp": pa.array([1, 2, 3], type=pa.timestamp("ms", tz="UTC")), "string": pa.array(["a", "b", "c"]), "list": pa.array([[1, 2], [3, 4], [5, 6]]), "decimal": pa.array([decimal.Decimal("1.0"), decimal.Decimal("2.0"), decimal.Decimal("3.0")]), }) pq.write_table(table, "test_v1.parquet", version="1.0") pq.write_table(table, "test_v2.parquet", version="2.0") In [55]: pq.read_metadata("test_v1.parquet").schema Out[55]: <pyarrow._parquet.ParquetSchema object at 0x7f839b7602c0> required group field_id=0 schema { optional int64 field_id=1 int64; optional int32 field_id=2 int32; optional int64 field_id=3 uint32; optional int32 field_id=4 date (Date); optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional binary field_id=6 string (String); optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2, scale=1)); } In [56]: pq.read_metadata("test_v2.parquet").schema Out[56]: <pyarrow._parquet.ParquetSchema object at 0x7f839bc43f80> required group field_id=0 schema { optional int64 field_id=1 int64; optional int32 field_id=2 int32; optional int32 field_id=3 uint32 (Int(bitWidth=32, isSigned=false)); optional int32 field_id=4 date (Date); optional int64 field_id=5 timestamp (Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)); optional binary field_id=6 string (String); optional fixed_len_byte_array(1) field_id=7 decimal (Decimal(precision=2, scale=1));