[
https://issues.apache.org/jira/browse/ARROW-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-11399:
------------------------------------------
Description:
I ran into this, and find it confusing:
{code}
In [1]: import pyarrow.parquet as pq
In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b':
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})
In [4]: pq.write_table(table, "test_parquet_schema.parquet")
In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]:
<ParquetColumnSchema>
name: a
path: a
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds,
is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]:
<ParquetColumnSchema>
name: b
path: b
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds,
is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS
{code}
So it "seems" that the parquet file has the legacy ConvertedType only set for
the second column, and not the first.
However, I am quite sure it sets it for both. Because that was the result of
the discussion about this at the time of pyarrow 0.14 (ARROW-5878,
https://github.com/apache/arrow/pull/4825), and can also be shown by reading
the parquet schema with an older version of pyarrow that doesn't support
logical types:
{code}
In [1]: import pyarrow.parquet as pq
In [2]: pa.__version__
Out[2]: '0.13.0'
In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}
I understand that when _reading_ the schema in a recent version of pyarrow, we
don't need the ConvertedType information anymore for proper reading of the
data, but seemingly indicating that the ConvertedType is not present in the
parquet schema is quite confusing (certainly if checking files for
forward/backward compatibility behaviour).
cc [~tpboudreau]
was:
I ran into this, and find it rather confusing:
{code}
In [1]: import pyarrow.parquet as pq
In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b':
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})
In [4]: pq.write_table(table, "test_parquet_schema.parquet")
In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]:
<ParquetColumnSchema>
name: a
path: a
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds,
is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]:
<ParquetColumnSchema>
name: b
path: b
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds,
is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS
{code}
So it "seems" that the parquet file has the legacy ConvertedType only set for
the second column, and not the first.
However, I am quite sure it sets it for both. Because that was the result of
the discussion about this at the time of pyarrow 0.14 (ARROW-5878,
https://github.com/apache/arrow/pull/4825), and can also be shown by reading
the parquet schema with an older version of pyarrow that doesn't support
logical types:
{code}
In [1]: import pyarrow.parquet as pq
In [2]: pa.__version__
Out[2]: '0.13.0'
In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}
I understand that when _reading_ the schema in a recent version of pyarrow, we
don't need the ConvertedType information anymore for proper reading of the
data, but seemingly indicating that the ConvertedType is not present in the
parquet schema is quite confusing (certainly if checking files for
forward/backward compatibility behaviour).
cc [~tpboudreau]
> [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly
> showing ConvertedType as NONE
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-11399
> URL: https://issues.apache.org/jira/browse/ARROW-11399
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: parquet
>
> I ran into this, and find it confusing:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b':
> pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})
> In [4]: pq.write_table(table, "test_parquet_schema.parquet")
> In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
> Out[5]:
> <ParquetColumnSchema>
> name: a
> path: a
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds,
> is_from_converted_type=false, force_set_converted_type=false)
> converted_type (legacy): NONE
> In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
> Out[6]:
> <ParquetColumnSchema>
> name: b
> path: b
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds,
> is_from_converted_type=false, force_set_converted_type=false)
> converted_type (legacy): TIMESTAMP_MILLIS
> {code}
> So it "seems" that the parquet file has the legacy ConvertedType only set for
> the second column, and not the first.
> However, I am quite sure it sets it for both. Because that was the result of
> the discussion about this at the time of pyarrow 0.14 (ARROW-5878,
> https://github.com/apache/arrow/pull/4825), and can also be shown by reading
> the parquet schema with an older version of pyarrow that doesn't support
> logical types:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [2]: pa.__version__
> Out[2]: '0.13.0'
> In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
> Out[4]:
> <pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
> a: INT64 TIMESTAMP_MILLIS
> b: INT64 TIMESTAMP_MILLIS
> {code}
> I understand that when _reading_ the schema in a recent version of pyarrow,
> we don't need the ConvertedType information anymore for proper reading of the
> data, but seemingly indicating that the ConvertedType is not present in the
> parquet schema is quite confusing (certainly if checking files for
> forward/backward compatibility behaviour).
> cc [~tpboudreau]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)