Hi dev, I would like to post a proposal about partitioned column type inference (related with 'spark.sql.sources.partitionColumnTypeInference.enabled' configuration).
This thread focuses on the type coercion (finding the common type) in partitioned columns, in particular, when the different form of data is inserted for the partition column and then it is read back with the type inference. *Problem:* val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts") df.write.format("parquet").partitionBy("ts").save("/tmp/foo") spark.read.load("/tmp/foo").printSchema() val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal") df.write.format("parquet").partitionBy("decimal").save("/tmp/bar") spark.read.load("/tmp/bar").printSchema() It currently returns: root |-- i: integer (nullable = true) |-- ts: date (nullable = true) root |-- i: integer (nullable = true) |-- decimal: integer (nullable = true) The type coercion looks less well designed yet and currently there are few holes which is not quite ideal: private val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) ... literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) The current way does not deal with when the types are outside of the upCastingOrder. It just returns the first type, as the type coerced one. This has been being discussed in https://github.com/apache/spark/pull/19389#discussion_r150426911, but I would like to have more feedback from community as it possibly is a breaking change. For the current releases of Spark (2.2.0 <=), we support the types below for partitioned column schema inference, given my investigation - https://github.com/apache/spark/pull/19389#discussion_r150528207: NullType IntegerType LongType DoubleType, *DecimalType(...) DateType TimestampType StringType *DecimalType only when it's bigger than LongType: I believe this is something we should definitely fix. *Proposal:* I propose the change - https://github.com/apache/spark/pull/19389 Simply, it reuses the case 2 specified in https://github.com/apache/spark/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L40-L43 Please refer the chart I produced here - https://github.com/apache/spark/pull/19389/files#r150528361. The current proposal will brings the type coercion behaviour change in those cases below: Input typesOld output typeNew output type [NullType, DecimalType(38,0)] StringType DecimalType(38,0) [NullType, DateType] StringType DateType [NullType, TimestampType] StringType TimestampType [IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0) [IntegerType, DateType] IntegerType StringType [IntegerType, TimestampType] IntegerType StringType [LongType, DecimalType(38,0)] LongType DecimalType(38,0) [LongType, DateType] LongType StringType [LongType, TimestampType] LongType StringType [DoubleType, DateType] DoubleType StringType [DoubleType, TimestampType] DoubleType StringType [DecimalType(38,0), NullType] StringType DecimalType(38,0) [DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0) [DecimalType(38,0), LongType] LongType DecimalType(38,0) [DecimalType(38,0), DateType] DecimalType(38,0) StringType [DecimalType(38,0), TimestampType] DecimalType(38,0) StringType [DateType, NullType] StringType DateType [DateType, IntegerType] IntegerType StringType [DateType, LongType] LongType StringType [DateType, DoubleType] DoubleType StringType [DateType, DecimalType(38,0)] DateType StringType [DateType, TimestampType] DateType TimestampType [TimestampType, NullType] StringType TimestampType [TimestampType, IntegerType] IntegerType StringType [TimestampType, LongType] LongType StringType [TimestampType, DoubleType] DoubleType StringType [TimestampType, DecimalType(38,0)] TimestampType StringType *Other possible suggestions**:* Probably, we could also consider simply making the merged type to string types when there are any type conflicts. Otherwise, there could be more stricter rules. I want more opinion from the community. *Questions:* - Does the *Proposal:* above looks good? - If not, what would be the alternative?