[
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060915#comment-17060915
]
Cosmin Iordache commented on HUDI-83:
-------------------------------------
I'm going to add here some more info regarding backwards compatibility.
Continuing investigation : I created a data-set with previous version (
spark-2.3.3 - hudi 0.5.0)
15:28:22.469 [DataLakeSystem-akka.actor.default-dispatcher-4] INFO
org.apache.hudi.HoodieSparkSqlWriter$ - Registered avro schema : {
"type" : "record",
"name" : "test_record",
"namespace" : "hoodie.test",
"fields" : [
... {
"name" : "other_date",
"type" : [ "long", "null" ]
}, {
"name" : "timestamp_1",
"type" : [ "long", "null" ]
},
...
]
}
Above is a timestamp saved as a long ( expected ).
I then created a new upsert with the same dataframe in (spark 2.4.4 , hudi
0.5.1) :
{
"type" : "record",
"name" : "test_record",
"namespace" : "hoodie.test",
"fields": {
...
"name" : "other_date",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}, {
"name" : "timestamp_1",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}
...
Ingestion worked. But on reading :
scala> q1.select("other_date","timestamp_1").show()
53277 [Executor task launch worker for task 3] ERROR
org.apache.spark.executor.Executor - Exception in task 1.0 in stage 2.0 (TID 3)
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot
be converted in file
[hdfs://namenode:8020/data/lake/3aa1f8f7-63a4-4aa4-986e-0ed835601999/converted/io_yields_hashkey/-2/8bddbe9e-d0d5-492a-b96d-9b196322a317-0_0-49-79_20200316152824.parquet]
. Column: [other_date], Expected: timestamp, Found: INT64
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:440)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:208)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
... 22 more
> Map Timestamp type in spark to corresponding Timestamp type in Hive during
> Hive sync
> ------------------------------------------------------------------------------------
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Components: Hive Integration, Usability
> Reporter: Vinoth Chandar
> Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] & related issues
--
This message was sent by Atlassian Jira
(v8.3.4#803005)