Re: FYI: The evolution on `CHAR` type behavior

Reynold Xin Thu, 19 Mar 2020 20:59:56 -0700

I agree it sucks. We started with some decision that might have made sense back 
in 2013 (let's use Hive as the default source, and guess what, pick the slowest 
possible serde by default). We are paying that debt ever since.


Thanks for bringing this thread up though. We don't have a clear solution yet, 
but at least it made a lot of people aware of the issues.

On Thu, Mar 19, 2020 at 8:56 PM, Dongjoon Hyun < [email protected] > 
wrote:

> 
> Technically, I has been suffered with (1) `CREATE TABLE` due to many
> difference for a long time (since 2017). So, I had a wrong assumption for
> the implication of that "(2) FYI: SPARK-30098 Use default datasource as
> provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
> in the similar way. However, it was a lot to me. Also, switching
> `convertMetastoreOrc` at 2.4 was a big change to me although there will be
> no difference for Parquet-only users.
> 
> 
> Dongjoon.
> 
> 
> > References:
> > 1. "CHAR implementation?", 2017/09/15
> >      https:/ / lists. apache. org/ thread. html/ 
> > 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> )
> > 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
> >    https:/ / lists. apache. org/ thread. html/ 
> > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> 
> 
> 
> On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin < rxin@ databricks. com (
> [email protected] ) > wrote:
> 
> 
>> You are joking when you said " informed widely and discussed in many ways
>> twice" right?
>> 
>> 
>> 
>> This thread doesn't even talk about char/varchar: https:/ / lists. apache.
>> org/ thread. html/ 
>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>> spark. apache. org%3E (
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>> )
>> 
>> 
>> 
>> (Yes it talked about changing the default data source provider, but that's
>> just one of the ways we are exposing this char/varchar issue).
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( [email protected] ) > wrote:
>> 
>>> +1 for Wenchen's suggestion.
>>> 
>>> I believe that the difference and effects are informed widely and
>>> discussed in many ways twice.
>>> 
>>> First, this was shared on last December.
>>> 
>>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>>> Second (at this time in this thread), this has been discussed according to
>>> the new community rubric.
>>> 
>>>     - https:/ / spark. apache. org/ versioning-policy. html (
>>> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
>>> When Breaking APIs")
>>> 
>>> 
>>> Thank you all.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
>>> [email protected] ) > wrote:
>>> 
>>> 
>>>> OK let me put a proposal here:
>>>> 
>>>> 
>>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>>> for Hive compatibility.
>>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>>> just ban it. Another way is to document that the padding of CHAR type is
>>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>>> Spark.
>>>> 
>>>> 
>>>> 2. Leave VARCHAR unchanged in 3.0
>>>> VARCHAR type is so widely used in databases and it's weird if Spark
>>>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>>>> when the length limitation is not hit, and I'm fine to temporarily leave
>>>> this flaw in 3.0 and users may hit behavior changes when the string values
>>>> hit the VARCHAR length limitation.
>>>> 
>>>> 
>>>> 3. Finalize the VARCHAR behavior in 3.1
>>>> For now I have 2 ideas:
>>>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>>>> should support VARCHAR, and CREATE TABLE should fail if a column is
>>>> VARCHAR type and the underlying data source doesn't support it (e.g.
>>>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>>>> updated as well.
>>>> b) Simply document that, the underlying data source may or may not enforce
>>>> the length limitation of VARCHAR(x).
>>>> 
>>>> 
>>>> Please let me know if you have different ideas.
>>>> 
>>>> 
>>>> Thanks,
>>>> Wenchen
>>>> 
>>>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>>>> ( [email protected] ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>>>> not have a plan to address this problem systematically.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> +1
>>>>> 
>>>>>  
>>>>> 
>>>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>>>> Document that char(x) is just an alias for string. And then move on.
>>>>>> Almost no work needs to be done...
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to