[GitHub] [hudi] lavakerreddy opened a new issue, #6278: [SUPPORT]

GitBox Tue, 02 Aug 2022 10:14:31 -0700


lavakerreddy opened a new issue, #6278:
URL: https://github.com/apache/hudi/issues/6278


   We upgraded ourselves from running our Hudi spark-submits from EMR 5.33 to 
EMR 6.5 that has Spark 3x and then started running into below errors with date 
and timestamp. Please let us know if someone faced a similar issue and if there 
is a resolution.
   spark-submit \
   --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.default.parallelism=500 \
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.initialExecutors=3 \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.app.name=ETS_CUST \
   --jars /usr/lib/spark/external/lib/spark-avro.jar, 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type MERGE_ON_READ \
   --op INSERT \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \
   --source-ordering-field dms_seq_no \
   --props 
s3://ets-aws-daas-prod-resource/config/TOEFL/DMEREG02/ETS_CUST/ets_cust_full.properties
 \
   --hoodie-conf 
hoodie.datasource.hive_sync.database=ets_aws_daas_raw_toefl_dmereg02 \
   --target-base-path s3://ets-aws-daas-prod-raw/TOEFL/DMEREG02/ETS_CUST \
   --target-table ETS_CUST \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://ets-aws-daas-prod-landing/DMS/FULL/DMEREG02/ETS_CUST/
 \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
--enable-sync
   22/08/02 16:24:48 INFO DAGScheduler: ShuffleMapStage 3 (countByKey at 
BaseSparkCommitActionExecutor.java:175) failed in 27.903 s due to Job aborted 
due to stage failure: Task 53 in stage 3.0 failed 4 times, most recent failure: 
Lost task 53.3 in stage 3.0 (TID 105) (ip-172-31-26-128.ec2.internal executor 
3): org.apache.spark.SparkUpgradeException: You may get a different result due 
to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
       at 
org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
       at 
org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readLongsWithRebase(VectorizedPlainValuesReader.java:147)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongsWithRebase(VectorizedRleValuesReader.java:399)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:587)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:297)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
       at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
       at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
       at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
       at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
       at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:907)
       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
       at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
       at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
       at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
       at 
[org.apache.spark.storage.BlockManager.org](http://org.apache.spark.storage.blockmanager.org/)$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
       at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
       at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
       at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
       at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
       at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
       at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
       at org.apache.spark.scheduler.Task.run(Task.scala:131)
       at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:750)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] lavakerreddy opened a new issue, #6278: [SUPPORT]

Reply via email to