I have heard about many complaints about the old table insertion behavior.
Blindly casting everything will leak the user mistake to a late stage of
the data pipeline, and make it very hard to debug. When a user writes
string values to an int column, it's probably a mistake and the columns are
misordered in the INSERT statement. We should fail the query earlier and
ask users to fix the mistake.

In the meanwhile, I agree that the new table insertion behavior we
introduced for Data Source V2 is too strict. It may fail valid queries
unexpectedly.

In general, I support the direction of following the ANSI SQL standard. But
I'd like to do it with 2 steps:
1. only add cast when the assignment rule is satisfied. This should be the
default behavior and we should provide a legacy config to restore to the
old behavior.
2. fail the cast operation at runtime if overflow happens. AFAIK Marco
Gaido is working on it already. This will have a config as well and by
default we still return null.

After doing this, the default behavior will be slightly different from the
SQL standard (cast can return null), and users can turn on the ANSI mode to
fully follow the SQL standard. This is much better than before and should
prevent a lot of user mistakes. It's also a reasonable choice to me to not
throw exceptions at runtime by default, as it's usually bad for
long-running jobs.

Thanks,
Wenchen

On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
gengliang.w...@databricks.com> wrote:

> Hi everyone,
>
> I would like to discuss the table insertion behavior of Spark. In the
> current data source V2, only UpCast is allowed for table insertion. I think
> following ANSI SQL is a better idea.
> For more information, please read the Discuss: Follow ANSI SQL on table
> insertion
> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
> Please let me know if you have any thoughts on this.
>
> Regards,
> Gengliang
>

Reply via email to