[jira] [Updated] (ARROW-11399) [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE

Joris Van den Bossche (Jira) Wed, 27 Jan 2021 05:27:05 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-11399:
------------------------------------------
    Description: 
I ran into this, and find it confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for 
the second column, and not the first. 

However, I am quite sure it sets it for both. Because that was the result of 
the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
https://github.com/apache/arrow/pull/4825), and can also be shown by reading 
the parquet schema with an older version of pyarrow that doesn't support 
logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we 
don't need the ConvertedType information anymore for proper reading of the 
data, but seemingly indicating that the ConvertedType is not present in the 
parquet schema is quite confusing (certainly if checking files for 
forward/backward compatibility behaviour).

cc [~tpboudreau]

  was:
I ran into this, and find it rather confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for 
the second column, and not the first. 

However, I am quite sure it sets it for both. Because that was the result of 
the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
https://github.com/apache/arrow/pull/4825), and can also be shown by reading 
the parquet schema with an older version of pyarrow that doesn't support 
logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we 
don't need the ConvertedType information anymore for proper reading of the 
data, but seemingly indicating that the ConvertedType is not present in the 
parquet schema is quite confusing (certainly if checking files for 
forward/backward compatibility behaviour).

cc [~tpboudreau]


> [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly 
> showing ConvertedType as NONE
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11399
>                 URL: https://issues.apache.org/jira/browse/ARROW-11399
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: parquet
>
> I ran into this, and find it confusing:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
> pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})
> In [4]: pq.write_table(table, "test_parquet_schema.parquet")
> In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
> Out[5]: 
> <ParquetColumnSchema>
>   name: a
>   path: a
>   max_definition_level: 1
>   max_repetition_level: 0
>   physical_type: INT64
>   logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
> is_from_converted_type=false, force_set_converted_type=false)
>   converted_type (legacy): NONE
> In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
> Out[6]: 
> <ParquetColumnSchema>
>   name: b
>   path: b
>   max_definition_level: 1
>   max_repetition_level: 0
>   physical_type: INT64
>   logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
> is_from_converted_type=false, force_set_converted_type=false)
>   converted_type (legacy): TIMESTAMP_MILLIS
> {code}
> So it "seems" that the parquet file has the legacy ConvertedType only set for 
> the second column, and not the first. 
> However, I am quite sure it sets it for both. Because that was the result of 
> the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
> https://github.com/apache/arrow/pull/4825), and can also be shown by reading 
> the parquet schema with an older version of pyarrow that doesn't support 
> logical types:
> {code}
> In [1]: import pyarrow.parquet as pq
> In [2]: pa.__version__
> Out[2]: '0.13.0'
> In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
> Out[4]: 
> <pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
> a: INT64 TIMESTAMP_MILLIS
> b: INT64 TIMESTAMP_MILLIS
> {code}
> I understand that when _reading_ the schema in a recent version of pyarrow, 
> we don't need the ConvertedType information anymore for proper reading of the 
> data, but seemingly indicating that the ConvertedType is not present in the 
> parquet schema is quite confusing (certainly if checking files for 
> forward/backward compatibility behaviour).
> cc [~tpboudreau]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11399) [C++][Parquet] Timestamp ColumnDescriptor (from logical type) incorrectly showing ConvertedType as NONE

Reply via email to