[
https://issues.apache.org/jira/browse/SPARK-36958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-36958.
----------------------------------
Resolution: Not A Problem
> Reading of legacy timestamps from Parquet confusing in Spark 3, related
> config values don't seem working
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-36958
> URL: https://issues.apache.org/jira/browse/SPARK-36958
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.1.2
> Environment: emr-6.4.0
> spark 3.1.2
> Reporter: Dmitry Goldenberg
> Priority: Major
>
> I'm having a major issue with trying to run in Spark 3, reading parquet data
> that got generated with Spark 2.4.
> The full stack trace is below.
> The error message is very confusing:
> # I do not have dates that before 1582-10-15 or timestamps before
> 1900-01-01T00:00:00Z
> # The documentation does not state clearly how to work around/fix this
> issue. What exactly is the difference between the LEGACY and CORRECTED values
> of the config settings?
> # Which of the following would I want to set and to what values? -
> spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> - spark.sql.legacy.parquet.datetimeRebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.timeParserPolicy
> # I've tried setting these to CORRECTED,CORRECTED,CORRECTED,CORRECTED, and
> LEGACY, respectively, and got the same error (see the stack trace).
> The issues that I see with this:
> # Lack of thorough clear documentation on what this is and how it's meant to
> work.
> # The confusing error message.
> # The fact that the error still occurs even when you set the config values.
>
> {quote} py4j.protocol.Py4JJavaError: An error occurred while calling
> o1134.count.py4j.protocol.Py4JJavaError: An error occurred while calling
> o1134.count.: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 8 in stage 36.0 failed 4 times, most recent failure: Lost task
> 8.3 in stage 36.0 (TID 619) (ip-10-2-251-59.awsinternal.audiomack.com
> executor 2): org.apache.spark.SparkUpgradeException: You may get a different
> result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or
> timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be
> ambiguous, as the files may be written by Spark 2.x or legacy versions of
> Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s
> Proleptic Gregorian calendar. See more details in SPARK-31404. You can set
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the
> datetime values w.r.t. the calendar difference during reading. Or set
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the
> datetime values as it is. at
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
> at
> org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseTimestamp(VectorizedColumnReader.java:228)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseInt96(VectorizedColumnReader.java:242)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:662)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:300)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
> at
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
> at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at
> org.apache.spark.scheduler.Task.run(Task.scala:131) at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {quote}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]