Ryan Pifer created HUDI-2972:
--------------------------------
Summary: Support different Spark internal Timestamp and Date types
Key: HUDI-2972
URL: https://issues.apache.org/jira/browse/HUDI-2972
Project: Apache Hudi
Issue Type: Improvement
Reporter: Ryan Pifer
In Spark 3 a configuration was added, {{spark.sql.datetime.java8API.enabled}}
which can modify the internal Row type of Timestamp and Date types to *Instant*
or {*}LocalDate{*}.
https://issues.apache.org/jira/browse/SPARK-27008
In Spark 3.1 this is enabled by default through spark-sql which will break
writes using Timestamps. It's also likely this could be enabled by default in
future across all Spark in which this would become a breaking issue
Right now in AvroConversionHelper
([ref|https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L301-L304])
and SqlKeyGenerator
([ref|https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala])
it cannot handle this properly.
When partitioned by Timestamp
{code:java}
Caused by: java.lang.IllegalArgumentException: Invalid format:
"2021-05-07T00:00:00Z" is malformed at "T00:00:00Z" at
org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
at
org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
at
org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$convertPartitionPathToSqlType$1(SqlKeyGenerator.scala:94)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
scala.collection.TraversableLike.map(TraversableLike.scala:238) at
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at
org.apache.spark.sql.hudi.command.SqlKeyGenerator.convertPartitionPathToSqlType(SqlKeyGenerator.scala:85)
at
org.apache.spark.sql.hudi.command.SqlKeyGenerator.getPartitionPath(SqlKeyGenerator.scala:115)
at
org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777){code}
Inserts with type Timestamp
{code:java}
21/10/21 18:14:17 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2)
(ip-10-71-235-164.ec2.internal executor 20): java.lang.ClassCastException:
java.time.Instant cannot be cast to java.sql.Timestamp at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:304)
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:304)
at scala.Option.map(Option.scala:230) at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:304)
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)