Ryan Pifer created HUDI-2972:
--------------------------------

             Summary: Support different Spark internal Timestamp and Date types
                 Key: HUDI-2972
                 URL: https://issues.apache.org/jira/browse/HUDI-2972
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Ryan Pifer


In Spark 3 a configuration was added, {{spark.sql.datetime.java8API.enabled}} 
which can modify the internal Row type of Timestamp and Date types to *Instant* 
or {*}LocalDate{*}. 

https://issues.apache.org/jira/browse/SPARK-27008

In Spark 3.1 this is enabled by default through spark-sql which will break 
writes using Timestamps. It's also likely this could be enabled by default in 
future across all Spark in which this would become a breaking issue

Right now in AvroConversionHelper 
([ref|https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L301-L304])
 and SqlKeyGenerator 
([ref|https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala])
 it cannot handle this properly.

When partitioned by Timestamp
{code:java}
Caused by: java.lang.IllegalArgumentException: Invalid format: 
"2021-05-07T00:00:00Z" is malformed at "T00:00:00Z" at 
org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
 at 
org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826) 
at 
org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$convertPartitionPathToSqlType$1(SqlKeyGenerator.scala:94)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
org.apache.spark.sql.hudi.command.SqlKeyGenerator.convertPartitionPathToSqlType(SqlKeyGenerator.scala:85)
 at 
org.apache.spark.sql.hudi.command.SqlKeyGenerator.getPartitionPath(SqlKeyGenerator.scala:115)
 at 
org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777){code}
Inserts with type Timestamp
{code:java}
21/10/21 18:14:17 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) 
(ip-10-71-235-164.ec2.internal executor 20): java.lang.ClassCastException: 
java.time.Instant cannot be cast to java.sql.Timestamp at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:304)
 at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:304)
 at scala.Option.map(Option.scala:230) at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:304)
 at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
 at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to