[jira] [Created] (PARQUET-1883) int96 support in parquet-avro

satish (Jira) Thu, 09 Jul 2020 11:02:18 -0700

satish created PARQUET-1883:
-------------------------------

             Summary: int96 support in parquet-avro
                 Key: PARQUET-1883
                 URL: https://issues.apache.org/jira/browse/PARQUET-1883
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.10.1
            Reporter: satish



Hi

It looks like 'timestamp' is being converted to 'int64' primitive type in 
parquet-avro. This is incompatible with hive2. Hive throws below error 

{code:java}
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
{code}


What does it take to write timestamp field as 'int96'? 

Hive seems to write timestamp field as int96.  See example below

{code:java}
$ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/000000_0
creator:     parquet-mr version 1.10.6 (build 
098c6199a821edd3d6af56b962fd0f1558af849b)

file schema: hive_schema
--------------------------------------------------------------------------------
ts:          OPTIONAL INT96 R:0 D:1

row group 1: RC:4 TS:88 OFFSET:4
--------------------------------------------------------------------------------
ts:           INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
{code}

Writing a spark dataframe into parquet format (without using avro) is also 
using int96.


{code:java}
scala> testDS.printSchema()
root
 |-- ts: timestamp (nullable = true)

scala> testDS.write.mode(Overwrite).save("/tmp/x");

$ parquet-tools meta 
/tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
file:        
file:/tmp/x/part-00000-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
creator:     parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
extra:       org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
 

file schema: spark_schema 
--------------------------------------------------------------------------------
ts:          OPTIONAL INT96 R:0 D:1

row group 1: RC:4 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
ts:           INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]


{code}



I saw some explanation for deprecating int96 [support 
here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
 from [~gszadovszky]. But given hive and serialization in other parquet modules 
(non-avro) support int96, I'm trying to understand the reasoning for not 
implementing it in parquet-avro.

A bit more context: we are trying to migrate some of our data to[ hudi 
format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
cases. But, when we write data using hudi, hudi uses parquet-avro and timestamp 
is being converted to int64. As mentioned earlier, this breaks compatibility 
with hive. A lot of columns in our tables have 'timestamp' as type in hive DDL. 
 It is almost impossible to change DDL to long as there are large number of 
tables and columns. 

We are happy to contribute if there is a clear path forward to support int96 in 
parquet-avro. Please also let me know if you are aware of a workaround in hive 
that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1883) int96 support in parquet-avro

Reply via email to