[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

nero (Jira) Thu, 10 Feb 2022 22:03:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490684#comment-17490684
 ]


nero commented on ARROW-15492:
------------------------------

I read the parquet-format document again, I found that I misunderstood the 
isAdjustedToUTC flag. 

Parquet format only has 7 physical types(BOOLEAN, INT32, INT64, INT96, FLOAT, 
DOUBLE, BYTE_ARRAY). 
{quote}Logical types are used to extend the types that parquet can be used to 
store, by specifying how the primitive types should be interpreted
{quote}
The isAdjustedToUTC flag is only contained in the timestamp logical type which 
is stored in the int64 physical type. So INT96(deprecated in [PARQUET-323] 
INT96 should be marked as deprecated) cannot get time zone information from 
this flag.

When hive reads a parquet column stored in the INT96 type, it will look at the 
table property(or use the local time zone if it is absent) to adjust the time 
zone.
{quote} * Hive will read Parquet MR int96 timestamp data and adjust values 
using a time zone from a table property, if set, or using the local time zone 
if it is absent. No adjustment will be applied to data written by Impala.{quote}
Hive:
 * [HIVE-12767] Implement table property to address Parquet int96 timestamp bug 
- ASF JIRA (apache.org)

Spark:
 * [SPARK-12297] Add work-around for Parquet/Hive int96 timestamp bug. - ASF 
JIRA (apache.org)

 

So I'm not sure whether Arrow should adjust int96 type data stored in parquet 
file with the local time zone?

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-15492
>                 URL: https://issues.apache.org/jira/browse/ARROW-15492
>             Project: Apache Arrow
>          Issue Type: New Feature
>    Affects Versions: 6.0.1
>            Reporter: nero
>            Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

Reply via email to