subject:"\[jira\] \[Commented\] \(PARQUET\-1883\) int96 support in parquet\-avro"

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-10-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216489#comment-17216489
 ] 

ASF GitHub Bot commented on PARQUET-1883:
-

anantdamle closed pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-10-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215801#comment-17215801
 ] 

ASF GitHub Bot commented on PARQUET-1883:
-

anantdamle commented on pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821#issuecomment-710758060


   Hi team,
   How should I proceed? We have many old Parquet files still using INT96, this 
patch will at least help in reading those files using ParquertIO.readers in 
Apache Beam.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-09-30 Thread Anant Damle (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204948#comment-17204948
 ] 

Anant Damle commented on PARQUET-1883:
--

This impacts older parquet files and reading such Parquet files in Apache Beam 
pipelines.

Proposed [PR-821|https://github.com/apache/parquet-mr/pull/821] 

> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-09-30 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204947#comment-17204947
 ] 

ASF GitHub Bot commented on PARQUET-1883:
-

anantdamle opened a new pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821


   https://issues.apache.org/jira/browse/PARQUET-1883
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET-1883) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
   
   ### Tests
   
   - [ ] My PR adds the following unit tests 
  TestAvroSchemaConverter.java
  Updated: TestAvroConverters.java
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
>

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-07-10 Thread Gabor Szadovszky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155322#comment-17155322
 ] 

Gabor Szadovszky commented on PARQUET-1883:
---

[~sha...@uber.com], [~satishkotha],

INT96 IS already deprecated. See PARQUET-323 and PARQUET-1870 for details. Hive 
also implemented the support of the INT64 timestamps (see HIVE-21215) 
unfortunately, only for 4.0. (Impala also moved to the INT64 timestamps 
already: IMPALA-5049)
Also would like to mention that parquet-avro has never supported INT96 
timestamps so it is not a regression. 

> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-07-09 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154888#comment-17154888
 ] 

Xinli Shang commented on PARQUET-1883:
--

[~gszadovszky], Do you still have links for INT96 will be deprecated? And do 
you have a suggestion to workaround for this case? 

> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

6 matches

Site Navigation

Mail list logo

Footer information