Joris Van den Bossche created ARROW-11399:
---------------------------------------------

             Summary: [C++][Parquet] Timestamp ColumnDescriptor (from logical 
type) incorrectly showing ConvertedType as NONE
                 Key: ARROW-11399
                 URL: https://issues.apache.org/jira/browse/ARROW-11399
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche


I ran into this, and find it rather confusing:

{code}
In [1]: import pyarrow.parquet as pq

In [3]: table = pa.table({'a': pa.array([1, 2], pa.timestamp("ms")), 'b': 
pa.array([1, 2], pa.timestamp("ms", tz="UTC"))})

In [4]: pq.write_table(table, "test_parquet_schema.parquet")

In [5]: pq.read_metadata("test_parquet_schema.parquet").schema.column(0)
Out[5]: 
<ParquetColumnSchema>
  name: a
  path: a
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [6]: pq.read_metadata("test_parquet_schema.parquet").schema.column(1)
Out[6]: 
<ParquetColumnSchema>
  name: b
  path: b
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, 
is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): TIMESTAMP_MILLIS
{code}

So it "seems" that the parquet file has the legacy ConvertedType only set for 
the second column, and not the first. 

However, I am quite sure it sets it for both. Because that was the result of 
the discussion about this at the time of pyarrow 0.14 (ARROW-5878, 
https://github.com/apache/arrow/pull/4825), and can also be shown by reading 
the parquet schema with an older version of pyarrow that doesn't support 
logical types:

{code}
In [1]: import pyarrow.parquet as pq

In [2]: pa.__version__
Out[2]: '0.13.0'

In [4]: pq.read_metadata("test_parquet_schema.parquet").schema
Out[4]: 
<pyarrow._parquet.ParquetSchema object at 0x7f67d407fe50>
a: INT64 TIMESTAMP_MILLIS
b: INT64 TIMESTAMP_MILLIS
{code}

I understand that when _reading_ the schema in a recent version of pyarrow, we 
don't need the ConvertedType information anymore for proper reading of the 
data, but seemingly indicating that the ConvertedType is not present in the 
parquet schema is quite confusing (certainly if checking files for 
forward/backward compatibility behaviour).

cc [~tpboudreau]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to