Re: FYI: The evolution on `CHAR` type behavior

Wenchen Fan Tue, 17 Mar 2020 22:42:40 -0700

OK let me put a proposal here:

1. Permanently ban CHAR for native data source tables, and only keep it for
Hive compatibility.
It's OK to forget about padding like what Snowflake and MySQL have done.
But it's hard for Spark to require consistent behavior about CHAR type in
all data sources. Since CHAR type is not that useful nowadays, seems OK to
just ban it. Another way is to document that the padding of CHAR type is
data source dependent, but it's a bit weird to leave this inconsistency in
Spark.

2. Leave VARCHAR unchanged in 3.0
VARCHAR type is so widely used in databases and it's weird if Spark doesn't
support it. VARCHAR type is exactly the same as Spark StringType when the
length limitation is not hit, and I'm fine to temporarily leave this flaw
in 3.0 and users may hit behavior changes when the string values hit the
VARCHAR length limitation.

3. Finalize the VARCHAR behavior in 3.1
For now I have 2 ideas:
a) Make VARCHAR(x) a first-class data type. This means Spark data sources
should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
type and the underlying data source doesn't support it (e.g. JSON/CSV).
Type cast, type coercion, table insertion, etc. should be updated as well.
b) Simply document that, the underlying data source may or may not enforce
the length limitation of VARCHAR(x).

Please let me know if you have different ideas.

Thanks,
Wenchen

On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <mich...@databricks.com>
wrote:

> What I'd oppose is to just ban char for the native data sources, and do
>> not have a plan to address this problem systematically.
>>
>
> +1
>
>
>> Just forget about padding, like what Snowflake and MySQL have done.
>> Document that char(x) is just an alias for string. And then move on. Almost
>> no work needs to be done...
>>
>
> +1
>
>

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to