[jira] [Updated] (ARROW-11399) [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE

Joris Van den Bossche (Jira) Wed, 27 Jan 2021 05:29:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-11399:
------------------------------------------
    Description: 
I ran into this, and find it confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for 
the second column, and not the first (where the first is timezone naive, and 
the second timezone aware). 

However, I am quite sure it sets it for both. Because that was the result of 
the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
https://github.com/apache/arrow/pull/4825. Initially we only set the 
ConvertedType for tz-aware data, but after discussion that was changed to do 
for both, see also the update to the parquet thrift at PARQUET-1627), and can 
also be shown by reading the parquet schema with an older version of pyarrow 
that doesn't support logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we 
don't need the ConvertedType information anymore for proper reading of the 
data, but seemingly indicating that the ConvertedType is not present in the 
parquet schema is quite confusing (certainly if checking files for 
forward/backward compatibility behaviour).

cc [~tpboudreau]

  was:
I ran into this, and find it confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for 
the second column, and not the first. 

However, I am quite sure it sets it for both. Because that was the result of 
the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
https://github.com/apache/arrow/pull/4825), and can also be shown by reading 
the parquet schema with an older version of pyarrow that doesn't support 
logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we 
don't need the ConvertedType information anymore for proper reading of the 
data, but seemingly indicating that the ConvertedType is not present in the 
parquet schema is quite confusing (certainly if checking files for 
forward/backward compatibility behaviour).

cc [~tpboudreau]


> [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly 
> showing ConvertedType as NONE
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11399
>                 URL: https://issues.apache.org/jira/browse/ARROW-11399
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: parquet
>
> I ran into this, and find it confusing:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
> pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})
> In [4]: pq.write_table(table, "test_parquet_schema.parquet")
> In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
> Out[5]: 
> <ParquetColumnSchema>
>   name: a
>   path: a
>   max_definition_level: 1
>   max_repetition_level: 0
>   physical_type: INT64
>   logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
> is_from_converted_type=false, force_set_converted_type=false)
>   converted_type (legacy): NONE
> In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
> Out[6]: 
> <ParquetColumnSchema>
>   name: b
>   path: b
>   max_definition_level: 1
>   max_repetition_level: 0
>   physical_type: INT64
>   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
> is_from_converted_type=false, force_set_converted_type=false)
>   converted_type (legacy): TIMESTAMP_MILLIS
> {code}
> So it "seems" that the parquet file has the legacy ConvertedType only set for 
> the second column, and not the first (where the first is timezone naive, and 
> the second timezone aware). 
> However, I am quite sure it sets it for both. Because that was the result of 
> the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
> https://github.com/apache/arrow/pull/4825. Initially we only set the 
> ConvertedType for tz-aware data, but after discussion that was changed to do 
> for both, see also the update to the parquet thrift at PARQUET-1627), and can 
> also be shown by reading the parquet schema with an older version of pyarrow 
> that doesn't support logical types:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [2]: pa.__version__
> Out[2]: '0.13.0'
> In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
> Out[4]: 
> <pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
> a: INT64 TIMESTAMP_MILLIS
> b: INT64 TIMESTAMP_MILLIS
> {code}
> I understand that when _reading_ the schema in a recent version of pyarrow, 
> we don't need the ConvertedType information anymore for proper reading of the 
> data, but seemingly indicating that the ConvertedType is not present in the 
> parquet schema is quite confusing (certainly if checking files for 
> forward/backward compatibility behaviour).
> cc [~tpboudreau]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11399) [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE

Reply via email to