[jira] [Commented] (HIVE-26612) Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)

Steve Carlin (Jira) Mon, 17 Oct 2022 09:45:32 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619000#comment-17619000
 ]


Steve Carlin commented on HIVE-26612:
-------------------------------------

Fuller explanation from my last comment:

When I initially saw the problem, I tried deleting the line where it went to 
the TimeStamp converter.  This caused it to go directly to the BIGINT->BIGINT 
ETypeConverter and this fixed the issue.

I had improperly assumed that HIVE-23345 was the issue that broke the 
functionality.  I didn't do the research on that because I didn't think it was 
important enough.  I had knowledge that the customer claimed it was working and 
I had knowledge that the correct ETypeConverter exists to get the customer the 
right information.  My apologies for not doing the research, but I still don't 
think it's important enough to go down the path to see what caused this since I 
will take the customer at their word here.

The fix involved is basically routing the code to use the correct 
ETypeConverter.  If we had to add a new ETypeConverter or change any code in 
that file, I'd be a bit more wary about the solution.  But given that this is a 
very small fix to use the proper ETypeConverter, presumably the same 
ETypeConverter that was being used in the customer environment makes me 
comfortable with the idea of giving the customer this fix.

> Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)
> ------------------------------------------------------------
>
>                 Key: HIVE-26612
>                 URL: https://issues.apache.org/jira/browse/HIVE-26612
>             Project: Hive
>          Issue Type: Bug
>          Components: Database/Schema
>            Reporter: Steve Carlin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If a parquet file has a Type of "int64 eventtime (TIMESTAMP(MILLIS,true))", 
> the following error is produced:
> {noformat}
> java.lang.RuntimeException: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in 
> block 0 in file 
> file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:213)
>       at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:98)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
> Caused by: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in 
> block 0 in file 
> file:/xxxx/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:624)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:531)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:197)
>       ... 55 more
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> file:/home/stamatis/Projects/Apache/hive/itests/qtest/target/tmp/parquet_format_ts_as_bigint/part-00000/timestamp_as_bigint.parquet
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255)
>       at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:87)
>       at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:771)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:335)
>       at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:562)
>       ... 57 more
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$10$1
>       at 
> org.apache.parquet.io.api.PrimitiveConverter.addLong(PrimitiveConverter.java:105)
>       at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:301)
>       at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:410)
>       at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>       at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
>       ... 63 more
> {noformat}
> The parquet file can be created with the following steps (through spark):
> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
> spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
> [1]
> val df = Seq(
> (1, Timestamp.valueOf("2014-01-01 23:00:01")),
> (1, Timestamp.valueOf("2014-11-30 12:40:32")),
> (2, Timestamp.valueOf("2016-12-29 09:54:00")),
> (2, Timestamp.valueOf("2016-05-09 10:12:43"))
> ).toDF("typeid","eventtime")
> [2]
> [root@c4839-node3 test_parquet2]# parquet-tools schema 
> part-00001-6c90b794-90b9-4cc0-afc5-2e49a4e96bad-c000.snappy.parquet
> message spark_schema {
> required int32 typeid;
> optional int64 eventtime (TIMESTAMP(MILLIS,true));
> }
> [3]
> [root@c4839-node3 test_parquet1]# parquet-tools schema 
> part-00001-cb1aeebb-ec87-4273-82ec-911c4fb605b6-c000.snappy.parquet
> message spark_schema {
> required int32 typeid;
> optional int96 eventtime;
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26612) Hive cannot read parquet files with int64 (TIMESTAMP_MILLIS)

Reply via email to