nikspatel03 opened a new issue #4947:
URL: https://github.com/apache/hudi/issues/4947
Hello Guys,
We are trying to process parquet files which contains historical dates but
we could not process with Delta streamer Hudi 0.9 version on EMR 6.5.0 (Spark
3.1.2)
**To Reproduce**
Steps to reproduce the behavior:
1. Create parquet files with historical date/timestamp - Before 1900 year (
for ex. 1869-01-03 08:44:00 )
ORDERID|ORDERTYPE|SUBMITTIME |P_COMMENTS |RESPONSEDATE |
-------|---------|-------------------|------------------|-------------------|
11701|O |2014-07-02 06:14:00|Order_11701_Online|2014-07-02 07:18:00|
11702|S |2015-11-05 06:14:00|Order_11702_Store |2015-11-06 09:11:00|
11703|O |1869-01-03 08:44:00|Order_11703_Online|1869-01-05 10:44:00|
2. Create EMR 6.5.0 cluster
3. Submit Hudi DeltaStreamer Job
spark-submit
--deploy-mode cluster
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true
--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.app.name=test_full_load
--jars /usr/lib/spark/external/lib/spark-avro.jar
/usr/lib/hudi/hudi-utilities-bundle.jar
--table-type COPY_ON_WRITE
--op INSERT
--source-ordering-field seq_no
--hoodie-conf hoodie.insert.shuffle.parallelism=25
--target-base-path s3://<YOUR-BUCKET>/TEST_HUDI_ORDER_TABLE
--target-table TEST_HUDI_ORDER_TABLE
--transformer-class
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource
--hoodie-conf hoodie.deltastreamer.source.dfs.root=<PATH_TO_PARQUET_FILE>
--hoodie-conf hoodie.deltastreamer.transformer.sql="select 1==2 AS
_hoodie_is_deleted, 'I' as Op, * from <SRC>"
--hoodie-conf hoodie.datasource.write.recordkey.field=ORDERID
--hoodie-conf hoodie.datasource.write.partitionpath.field=
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
--hoodie-conf hoodie.datasource.hive_sync.database=test_db
--hoodie-conf hoodie.datasource.hive_sync.table=TEST_HUDI_ORDER_TABLE
--hoodie-conf hoodie.datasource.hive_sync.enable=true
--hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false
--hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
--hoodie-conf
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://10.67.28.103:10000
--hoodie-conf hoodie.datasource.hive_sync.support_timestamp=false
--enable-sync
4. Job will fail with following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID
4) (ip-10-67-28-48.ec2.internal executor 9):
org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian
calendar. See more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the
datetime values w.r.t. the calendar difference during reading. Or set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the
datetime values as it is.
at
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
**Expected behavior**
Hudi job should process parquet file records. It is working fine in EMR
5.33.0 (Spark 2.4.7, Hudi 0.7)
**Environment Description**
* EMR Version : 6.5.0
* Hudi version : 0.9.0-amzn-1
* Spark version : 3.1.2
* Hive version : 3.1.2
* Hadoop version : 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
It is also not working after setting spark-sql configs mentioned in above
exception.
Probable Reason: Spark-sql configs not taking into consideration if we use
RDD API - it'll only take into affect if we use DATAFRAME API.
**Stacktrace**
```
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
at scala.Option.foreach(Option.scala:407)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1557)
at
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557)
at org.apache.spark.api.java.JavaRDDLike.isEmpty(JavaRDDLike.scala:545)
at org.apache.spark.api.java.JavaRDDLike.isEmpty$(JavaRDDLike.scala:545)
at
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:437)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:280)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:186)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:184)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:513)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or
timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as
the files may be written by Spark 2.x or legacy versions of Hive, which uses a
legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian
calendar. See more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the
datetime values w.r.t. the calendar difference during reading. Or set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the
datetime values as it is.
at
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.isEmpty(Iterator.scala:385)
at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/03/03 19:09:46 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID
4) (ip-10-67-28-48.ec2.internal executor 9):
org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian
calendar. See more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the
datetime values w.r.t. the calendar difference during reading. Or set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the
datetime values as
it is.
at
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.isEmpty(Iterator.scala:385)
at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
at scala.Option.foreach(Option.scala:407)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1557)
at
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557)
at org.apache.spark.api.java.JavaRDDLike.isEmpty(JavaRDDLike.scala:545)
at org.apache.spark.api.java.JavaRDDLike.isEmpty$(JavaRDDLike.scala:545)
at
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:437)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:280)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:186)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:184)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:513)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
Caused by: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or
timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as
the files may be written by Spark 2.x or legacy versions of Hive, which uses a
legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian
calendar. See more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the
datetime values w.r.t. the calendar difference during reading. Or set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the
datetime values as it is.
at
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:244)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator.isEmpty(Iterator.scala:385)
at scala.collection.Iterator.isEmpty$(Iterator.scala:385)
at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1429)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$2(HoodieSparkUtils.scala:135)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
)
22/03/03 19:09:46 INFO ShutdownHookManager: Shutdown hook called
22/03/03 19:09:46 INFO ShutdownHookManager: Deleting directory
/mnt/yarn/usercache/hadoop/appcache/application_1646326357878_0008/spark-50f460d2-56aa-409d-a19b-09918f345c65```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]