[jira] [Resolved] (SPARK-36958) Reading of legacy timestamps from Parquet confusing in Spark 3, related config values don't seem working

Hyukjin Kwon (Jira) Mon, 11 Oct 2021 19:41:26 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-36958.
----------------------------------
    Resolution: Not A Problem

> Reading of legacy timestamps from Parquet confusing in Spark 3, related 
> config values don't seem working
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36958
>                 URL: https://issues.apache.org/jira/browse/SPARK-36958
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2
>         Environment: emr-6.4.0
> spark 3.1.2
>            Reporter: Dmitry Goldenberg
>            Priority: Major
>
> I'm having a major issue with trying to run in Spark 3, reading parquet data 
> that got generated with Spark 2.4.
> The full stack trace is below.
> The error message is very confusing:
>  # I do not have dates that before 1582-10-15 or timestamps before 
> 1900-01-01T00:00:00Z
>  # The documentation does not state clearly how to work around/fix this 
> issue. What exactly is the difference between the LEGACY and CORRECTED values 
> of the config settings?
>  # Which of the following would I want to set and to what values? - 
> spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> - spark.sql.legacy.parquet.datetimeRebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.timeParserPolicy
>  # I've tried setting these to CORRECTED,CORRECTED,CORRECTED,CORRECTED, and 
> LEGACY, respectively, and got the same error (see the stack trace).
> The issues that I see with this:
>  # Lack of thorough clear documentation on what this is and how it's meant to 
> work.
>  # The confusing error message.
>  # The fact that the error still occurs even when you set the config values.
>  
> {quote} py4j.protocol.Py4JJavaError: An error occurred while calling 
> o1134.count.py4j.protocol.Py4JJavaError: An error occurred while calling 
> o1134.count.: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 8 in stage 36.0 failed 4 times, most recent failure: Lost task 
> 8.3 in stage 36.0 (TID 619) (ip-10-2-251-59.awsinternal.audiomack.com 
> executor 2): org.apache.spark.SparkUpgradeException: You may get a different 
> result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
> timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be 
> ambiguous, as the files may be written by Spark 2.x or legacy versions of 
> Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s 
> Proleptic Gregorian calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is. at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseTimestamp(VectorizedColumnReader.java:228)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseInt96(VectorizedColumnReader.java:242)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:662)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:300)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
>  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-36958) Reading of legacy timestamps from Parquet confusing in Spark 3, related config values don't seem working

Reply via email to