Re: [discuss][SQL] Partitioned column type inference proposal

Hyukjin Kwon Tue, 14 Nov 2017 18:16:28 -0800

Thanks all for feedback.

> 1. when merging NullType with another type, the result should always be
that type.
> 2. when merging StringType with another type, the result should always be
StringType.
> 3. when merging integral types, the priority from high to low:
DecimalType, LongType, IntegerType. This is because DecimalType is used as
big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.


> Rather than a list, can we create a table? X-axis is data type, and
Y-axis is also data type, and the intersection explains what the coerced
type is?


Here, I produced a table as below:

*Before*
InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)
DateTypeTimestampTypeStringType
*NullType* StringType IntegerType LongType DoubleType StringType StringType
StringType StringType
*IntegerType* IntegerType IntegerType LongType DoubleType IntegerType
IntegerType IntegerType StringType
*LongType* LongType LongType LongType DoubleType LongType LongType LongType
StringType
*DoubleType* DoubleType DoubleType DoubleType DoubleType DoubleType
DoubleType DoubleType StringType
*DecimalType(38,0)* StringType IntegerType LongType DoubleType
DecimalType(38,0) DecimalType(38,0) DecimalType(38,0) StringType
*DateType* StringType IntegerType LongType DoubleType DateType DateType
DateType StringType
*TimestampType* StringType IntegerType LongType DoubleType TimestampType
TimestampType TimestampType StringType
*StringType* StringType StringType StringType StringType StringType
StringType StringType StringType

*After*
InputA \ InputBNullTypeIntegerTypeLongTypeDoubleTypeDecimalType(38,0)
DateTypeTimestampTypeStringType
*NullType* StringType IntegerType LongType DoubleType DecimalType(38,0)
DateType TimestampType StringType
*IntegerType* IntegerType IntegerType LongType DoubleType DecimalType(38,0)
StringType StringType StringType
*LongType* LongType LongType LongType DoubleType DecimalType(38,0)
StringType StringType StringType
*DoubleType* DoubleType DoubleType DoubleType DoubleType DoubleType
StringType StringType StringType
*DecimalType(38,0)* DecimalType(38,0) DecimalType(38,0) DecimalType(38,0)
DoubleType DecimalType(38,0) StringType StringType StringType
*DateType* DateType StringType StringType StringType StringType DateType
TimestampType StringType
*TimestampType* TimestampType StringType StringType StringType StringType
TimestampType TimestampType StringType
*StringType* StringType StringType StringType StringType StringType
StringType StringType StringType

Seems following Wenchen's comments (1. to 5.), and I also updated PR
description there with some codes I used.


> Can we also look at what Hive, standard SQL (Postgres?) do?
> Also, this shouldn't be isolated to partition column inference.
> We should make sure most of the type coercions are consistent across
different functionalities, with the caveat that we need to preserve
backward compatibility.


Sure, so, if I understood correctly, we preserve backward compatibility for
improvements, but not for bugs in general.


Probably, there are two things to be done now:

  - Deduplicates the type coercion logics, and fixes the obvious bugs, that
doesn't make sense, (e.g., decimal and timestamp ends up with decimal).

  - Improves the deduplicated type coercion to follow other systems like
Hive and Postgres, or to make it sounds coherent by referring them.


I think my PR here focuses on the former, because the current partition
column inference itself is already isolated
and I am trying to propose to put those type coercion into one place in the
TypeCoercion. We could also consider
some divergence as exceptions in the nature of this functionality too,
maybe .. ? (although I agree that  most of the
type coercions are consistent across different functionalities in general).

If the deduplicated type coercion itself across functionalities should be
changed as an improvement, it should preserve
backward compatibility if I understood correctly and requires to take a
look for other systems.

I am willing to do the latter soon; however, it probably takes a quite
while for me to investigate and propose a change.

So, meanwhile, could we separately proceed this PR maybe or, probably would
there be something I missed?



2017-11-15 7:29 GMT+09:00 Reynold Xin <r...@databricks.com>:

> Most of those thoughts from Wenchen make sense to me.
>
>
> Rather than a list, can we create a table? X-axis is data type, and Y-axis
> is also data type, and the intersection explains what the coerced type is?
> Can we also look at what Hive, standard SQL (Postgres?) do?
>
>
> Also, this shouldn't be isolated to partition column inference. We should
> make sure most of the type coercions are consistent across different
> functionalities, with the caveat that we need to preserve backward
> compatibility.
>
>
>
> On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> My 2 cents:
>>
>> 1. when merging NullType with another type, the result should always be
>> that type.
>> 2. when merging StringType with another type, the result should always be
>> StringType.
>> 3. when merging integral types, the priority from high to low:
>> DecimalType, LongType, IntegerType. This is because DecimalType is used as
>> big integer when paring partition column values.
>> 4. DoubleType can't be merged with other types, except DoubleType itself.
>> 5. when merging TimestampType with DateType, return TimestampType.
>>
>>
>> On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> Hi dev,
>>>
>>> I would like to post a proposal about partitioned column type inference
>>> (related with 'spark.sql.sources.partitionColumnTypeInference.enabled'
>>> configuration).
>>>
>>> This thread focuses on the type coercion (finding the common type) in
>>> partitioned columns, in particular, when the different form of data is
>>> inserted for the partition column and then it is read back with the type
>>> inference.
>>>
>>>
>>> *Problem:*
>>>
>>>
>>> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
>>> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
>>> spark.read.load("/tmp/foo").printSchema()
>>> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
>>> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
>>> spark.read.load("/tmp/bar").printSchema()
>>>
>>>
>>>
>>> It currently returns:
>>>
>>>
>>> root
>>>  |-- i: integer (nullable = true)
>>>  |-- ts: date (nullable = true)
>>>
>>> root
>>>  |-- i: integer (nullable = true)
>>>  |-- decimal: integer (nullable = true)
>>>
>>>
>>> The type coercion looks less well designed yet and currently there are
>>> few holes which is not quite ideal:
>>>
>>>
>>> private val upCastingOrder: Seq[DataType] =
>>>   Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
>>> ...
>>> literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
>>>
>>>
>>>
>>> The current way does not deal with when the types are outside of the
>>> upCastingOrder. It just returns the first type, as the type coerced one.
>>>
>>> This has been being discussed in https://github.com/apache/s
>>> park/pull/19389#discussion_r150426911, but I would like to have more
>>> feedback from community as it possibly is a breaking change.
>>>
>>> For the current releases of Spark (2.2.0 <=), we support the types below
>>> for partitioned column schema inference, given my investigation -
>>> https://github.com/apache/spark/pull/19389#discussion_r150528207:
>>>
>>>   NullType
>>>   IntegerType
>>>   LongType
>>>   DoubleType,
>>>   *DecimalType(...)
>>>   DateType
>>>   TimestampType
>>>   StringType
>>>
>>>   *DecimalType only when it's bigger than LongType:
>>>
>>> I believe this is something we should definitely fix.
>>>
>>>
>>>
>>> *Proposal:*
>>>
>>> I propose the change - https://github.com/apache/spark/pull/19389
>>>
>>> Simply, it reuses the case 2 specified in https://github.com/apache/s
>>> park/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catal
>>> yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Ty
>>> peCoercion.scala#L40-L43
>>>
>>> Please refer the chart I produced here - https://github.com/apache/sp
>>> ark/pull/19389/files#r150528361. The current proposal will brings the
>>> type coercion behaviour change in those cases below:
>>>
>>>
>>> Input typesOld output typeNew output type
>>> [NullType, DecimalType(38,0)] StringType DecimalType(38,0)
>>> [NullType, DateType] StringType DateType
>>> [NullType, TimestampType] StringType TimestampType
>>> [IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)
>>> [IntegerType, DateType] IntegerType StringType
>>> [IntegerType, TimestampType] IntegerType StringType
>>> [LongType, DecimalType(38,0)] LongType DecimalType(38,0)
>>> [LongType, DateType] LongType StringType
>>> [LongType, TimestampType] LongType StringType
>>> [DoubleType, DateType] DoubleType StringType
>>> [DoubleType, TimestampType] DoubleType StringType
>>> [DecimalType(38,0), NullType] StringType DecimalType(38,0)
>>> [DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)
>>> [DecimalType(38,0), LongType] LongType DecimalType(38,0)
>>> [DecimalType(38,0), DateType] DecimalType(38,0) StringType
>>> [DecimalType(38,0), TimestampType] DecimalType(38,0) StringType
>>> [DateType, NullType] StringType DateType
>>> [DateType, IntegerType] IntegerType StringType
>>> [DateType, LongType] LongType StringType
>>> [DateType, DoubleType] DoubleType StringType
>>> [DateType, DecimalType(38,0)] DateType StringType
>>> [DateType, TimestampType] DateType TimestampType
>>> [TimestampType, NullType] StringType TimestampType
>>> [TimestampType, IntegerType] IntegerType StringType
>>> [TimestampType, LongType] LongType StringType
>>> [TimestampType, DoubleType] DoubleType StringType
>>> [TimestampType, DecimalType(38,0)] TimestampType StringType
>>>
>>>
>>>
>>> *Other possible suggestions**:*
>>>
>>> Probably, we could also consider simply making the merged type to string
>>> types when there are any type conflicts.
>>> Otherwise, there could be more stricter rules. I want more opinion from
>>> the community.
>>>
>>>
>>>
>>> *Questions:*
>>>
>>> - Does the *Proposal:* above looks good?
>>>
>>> - If not, what would be the alternative?
>>>
>>>
>>>
>>>
>>>
>>
>

Re: [discuss][SQL] Partitioned column type inference proposal

Reply via email to