lavakerreddy opened a new issue, #6278:
URL: https://github.com/apache/hudi/issues/6278
We upgraded ourselves from running our Hudi spark-submits from EMR 5.33 to
EMR 6.5 that has Spark 3x and then started running into below errors with date
and timestamp. Please let us know if someone faced a similar issue and if there
is a resolution.
spark-submit \
--deploy-mode client \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--conf spark.shuffle.service.enabled=true \
--conf spark.default.parallelism=500 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=3 \
--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.app.name=ETS_CUST \
--jars /usr/lib/spark/external/lib/spark-avro.jar,
/usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type MERGE_ON_READ \
--op INSERT \
--hoodie-conf
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \
--source-ordering-field dms_seq_no \
--props
s3://ets-aws-daas-prod-resource/config/TOEFL/DMEREG02/ETS_CUST/ets_cust_full.properties
\
--hoodie-conf
hoodie.datasource.hive_sync.database=ets_aws_daas_raw_toefl_dmereg02 \
--target-base-path s3://ets-aws-daas-prod-raw/TOEFL/DMEREG02/ETS_CUST \
--target-table ETS_CUST \
--transformer-class
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
--hoodie-conf
hoodie.deltastreamer.source.dfs.root=s3://ets-aws-daas-prod-landing/DMS/FULL/DMEREG02/ETS_CUST/
\
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource
--enable-sync
22/08/02 16:24:48 INFO DAGScheduler: ShuffleMapStage 3 (countByKey at
BaseSparkCommitActionExecutor.java:175) failed in 27.903 s due to Job aborted
due to stage failure: Task 53 in stage 3.0 failed 4 times, most recent failure:
Lost task 53.3 in stage 3.0 (TID 105) (ip-172-31-26-128.ec2.internal executor
3): org.apache.spark.SparkUpgradeException: You may get a different result due
to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian
calendar. See more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the
datetime values w.r.t. the calendar difference during reading. Or set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the
datetime values as it is.
at
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
at
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
at
[org.apache.spark.storage.BlockManager.org](http://org.apache.spark.storage.blockmanager.org/)$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]