[discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon Tue, 14 Nov 2017 06:55:17 -0800

Hi dev,

I would like to post a proposal about partitioned column type inference
(related with 'spark.sql.sources.partitionColumnTypeInference.enabled'
configuration).


This thread focuses on the type coercion (finding the common type) in
partitioned columns, in particular, when the different form of data is
inserted for the partition column and then it is read back with the type
inference.


*Problem:*


val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()
val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()



It currently returns:


root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)


The type coercion looks less well designed yet and currently there are few
holes which is not quite ideal:


private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))



The current way does not deal with when the types are outside of the
upCastingOrder. It just returns the first type, as the type coerced one.

This has been being discussed in
https://github.com/apache/spark/pull/19389#discussion_r150426911, but I
would like to have more feedback from community as it possibly is a
breaking change.

For the current releases of Spark (2.2.0 <=), we support the types below
for partitioned column schema inference, given my investigation -
https://github.com/apache/spark/pull/19389#discussion_r150528207:

  NullType
  IntegerType
  LongType
  DoubleType,
  *DecimalType(...)
  DateType
  TimestampType
  StringType

  *DecimalType only when it's bigger than LongType:

I believe this is something we should definitely fix.



*Proposal:*

I propose the change - https://github.com/apache/spark/pull/19389

Simply, it reuses the case 2 specified in
https://github.com/apache/spark/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L40-L43

Please refer the chart I produced here -
https://github.com/apache/spark/pull/19389/files#r150528361. The current
proposal will brings the type coercion behaviour change in those cases
below:


Input typesOld output typeNew output type
[NullType, DecimalType(38,0)] StringType DecimalType(38,0)
[NullType, DateType] StringType DateType
[NullType, TimestampType] StringType TimestampType
[IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)
[IntegerType, DateType] IntegerType StringType
[IntegerType, TimestampType] IntegerType StringType
[LongType, DecimalType(38,0)] LongType DecimalType(38,0)
[LongType, DateType] LongType StringType
[LongType, TimestampType] LongType StringType
[DoubleType, DateType] DoubleType StringType
[DoubleType, TimestampType] DoubleType StringType
[DecimalType(38,0), NullType] StringType DecimalType(38,0)
[DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)
[DecimalType(38,0), LongType] LongType DecimalType(38,0)
[DecimalType(38,0), DateType] DecimalType(38,0) StringType
[DecimalType(38,0), TimestampType] DecimalType(38,0) StringType
[DateType, NullType] StringType DateType
[DateType, IntegerType] IntegerType StringType
[DateType, LongType] LongType StringType
[DateType, DoubleType] DoubleType StringType
[DateType, DecimalType(38,0)] DateType StringType
[DateType, TimestampType] DateType TimestampType
[TimestampType, NullType] StringType TimestampType
[TimestampType, IntegerType] IntegerType StringType
[TimestampType, LongType] LongType StringType
[TimestampType, DoubleType] DoubleType StringType
[TimestampType, DecimalType(38,0)] TimestampType StringType



*Other possible suggestions**:*

Probably, we could also consider simply making the merged type to string
types when there are any type conflicts.
Otherwise, there could be more stricter rules. I want more opinion from the
community.



*Questions:*

- Does the *Proposal:* above looks good?

- If not, what would be the alternative?

[discuss][SQL] Partitioned column type inference proposal

Reply via email to