[jira] [Commented] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

Cosmin Iordache (Jira) Tue, 17 Mar 2020 06:39:15 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060915#comment-17060915
 ]


Cosmin Iordache commented on HUDI-83:
-------------------------------------

I'm going to add here some more info regarding backwards compatibility. 
Continuing investigation : I created a data-set with previous version ( 
spark-2.3.3 - hudi 0.5.0)
15:28:22.469 [DataLakeSystem-akka.actor.default-dispatcher-4]  INFO 
org.apache.hudi.HoodieSparkSqlWriter$  - Registered avro schema : {
  "type" : "record",
  "name" : "test_record",
  "namespace" : "hoodie.test",
  "fields" : [ 
  ... {
    "name" : "other_date",
    "type" : [ "long", "null" ]
  }, {
    "name" : "timestamp_1",
    "type" : [ "long", "null" ]
  }, 
...
   ]
}
Above is a timestamp saved as a long ( expected ).
I then created a new upsert with the same dataframe in (spark 2.4.4 , hudi 
0.5.1) :
{
  "type" : "record",
  "name" : "test_record",
  "namespace" : "hoodie.test",
  "fields": {
...
    "name" : "other_date",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  }, {
    "name" : "timestamp_1",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  }
...
Ingestion worked. But on reading :
 


scala> q1.select("other_date","timestamp_1").show()
53277 [Executor task launch worker for task 3] ERROR 
org.apache.spark.executor.Executor  - Exception in task 1.0 in stage 2.0 (TID 3)
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
be converted in file 
[hdfs://namenode:8020/data/lake/3aa1f8f7-63a4-4aa4-986e-0ed835601999/converted/io_yields_hashkey/-2/8bddbe9e-d0d5-492a-b96d-9b196322a317-0_0-49-79_20200316152824.parquet]
. Column: [other_date], Expected: timestamp, Found: INT64
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:440)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:208)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
        ... 22 more
 
 

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> ------------------------------------------------------------------------------------
>
>                 Key: HUDI-83
>                 URL: https://issues.apache.org/jira/browse/HUDI-83
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Hive Integration, Usability
>            Reporter: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

Reply via email to