[GitHub] [hudi] nikspatel03 opened a new issue #4947: [SUPPORT] Hudi 0.9 couldn't process older timestamp column from parquet files in delta streamer with Spark 3.1.2

GitBox Thu, 03 Mar 2022 11:45:38 -0800


nikspatel03 opened a new issue #4947:
URL: https://github.com/apache/hudi/issues/4947



   Hello Guys,
   
   We are trying to process parquet files which contains historical dates but 
we could not process with Delta streamer Hudi 0.9 version on EMR 6.5.0 (Spark 
3.1.2)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create parquet files with historical date/timestamp - Before 1900 year ( 
for ex. 1869-01-03 08:44:00 )
   
   ORDERID|ORDERTYPE|SUBMITTIME         |P_COMMENTS        |RESPONSEDATE       |
   -------|---------|-------------------|------------------|-------------------|
     11701|O        |2014-07-02 06:14:00|Order_11701_Online|2014-07-02 07:18:00|
     11702|S        |2015-11-05 06:14:00|Order_11702_Store |2015-11-06 09:11:00|
     11703|O        |1869-01-03 08:44:00|Order_11703_Online|1869-01-05 10:44:00|
     
   2.  Create EMR 6.5.0 cluster
   
   3. Submit Hudi DeltaStreamer Job 
   
   spark-submit 
   --deploy-mode cluster 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.dynamicAllocation.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.app.name=test_full_load 
   --jars /usr/lib/spark/external/lib/spark-avro.jar 
/usr/lib/hudi/hudi-utilities-bundle.jar 
   --table-type COPY_ON_WRITE 
   --op INSERT 
   --source-ordering-field seq_no 
   --hoodie-conf hoodie.insert.shuffle.parallelism=25 
   --target-base-path s3://<YOUR-BUCKET>/TEST_HUDI_ORDER_TABLE 
   --target-table TEST_HUDI_ORDER_TABLE 
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=<PATH_TO_PARQUET_FILE>
   --hoodie-conf hoodie.deltastreamer.transformer.sql="select 1==2 AS 
_hoodie_is_deleted, 'I' as Op, * from <SRC>"
   --hoodie-conf hoodie.datasource.write.recordkey.field=ORDERID
   --hoodie-conf hoodie.datasource.write.partitionpath.field=
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   --hoodie-conf hoodie.datasource.hive_sync.database=test_db
   --hoodie-conf hoodie.datasource.hive_sync.table=TEST_HUDI_ORDER_TABLE
   --hoodie-conf hoodie.datasource.hive_sync.enable=true
   --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://10.67.28.103:10000
   --hoodie-conf hoodie.datasource.hive_sync.support_timestamp=false
   --enable-sync
   
   
   4.  Job will fail with following exception: 
   
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4) (ip-10-67-28-48.ec2.internal executor 9): 
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
   
   **Expected behavior**
   
   Hudi job should process parquet file records.  It is working fine in EMR 
5.33.0 (Spark 2.4.7, Hudi 0.7)
   
   
   **Environment Description**
   
   * EMR Version : 6.5.0
   
   * Hudi version : 0.9.0-amzn-1
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   It is also not working after setting spark-sql configs mentioned in above 
exception.  
   Probable Reason: Spark-sql configs not taking into consideration if we use 
RDD API - it'll only take into affect if we use DATAFRAME API. 
   
   
   **Stacktrace**
   
   ```
   
   Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
        at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
        at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1557)
        at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557)
        at org.apache.spark.api.java.JavaRDDLike.isEmpty(JavaRDDLike.scala:545)
        at org.apache.spark.api.java.JavaRDDLike.isEmpty$(JavaRDDLike.scala:545)
        at 
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:437)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:280)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:186)
        at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:184)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:513)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
   Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as 
the files may be written by Spark 2.x or legacy versions of Hive, which uses a 
legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator.isEmpty(Iterator.scala:385)
        at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
        at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
        at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   22/03/03 19:09:46 INFO ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4) (ip-10-67-28-48.ec2.internal executor 9): 
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as 
 it is.
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator.isEmpty(Iterator.scala:385)
        at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
        at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
        at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   
   Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
        at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
        at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1557)
        at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557)
        at org.apache.spark.api.java.JavaRDDLike.isEmpty(JavaRDDLike.scala:545)
        at org.apache.spark.api.java.JavaRDDLike.isEmpty$(JavaRDDLike.scala:545)
        at 
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:437)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:280)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:186)
        at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:184)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:513)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
   Caused by: org.apache.spark.SparkUpgradeException: You may get a different 
result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as 
the files may be written by Spark 2.x or legacy versions of Hive, which uses a 
legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
        at 
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator.isEmpty(Iterator.scala:385)
        at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
        at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
        at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   )
   22/03/03 19:09:46 INFO ShutdownHookManager: Shutdown hook called
   22/03/03 19:09:46 INFO ShutdownHookManager: Deleting directory 
/mnt/yarn/usercache/hadoop/appcache/application_1646326357878_0008/spark-50f460d2-56aa-409d-a19b-09918f345c65```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nikspatel03 opened a new issue #4947: [SUPPORT] Hudi 0.9 couldn't process older timestamp column from parquet files in delta streamer with Spark 3.1.2

Reply via email to