I don’t think this is a good idea. Following the ANSI standard is usually
fine, but here it would *silently corrupt data*.

>From your proposal doc, ANSI allows implicitly casting from long to int
(any numeric type to any other numeric type) and inserts NULL when a value
overflows. That would drop data values and is not safe.

Fixing the silent corruption by adding a runtime exception is not a good
option, either. That puts off the problem until much of the job has
completed, instead of catching the error at analysis time. It is better to
catch this earlier during analysis than to run most of a job and then fail.

In addition, part of the justification for using the ANSI standard is to
avoid breaking existing jobs. But the new behavior is only applied in
DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
break other behavior anyway.

I think that the correct solution is to go with the existing validation
rules that require explicit casts to truncate values.

That still leaves the use case that motivated this proposal, which is that
floating point literals are parsed as decimals and fail simple insert
statements. We already came up with two alternatives to fix that problem in
the DSv2 sync and I think it is a better idea to go with one of those
instead of “fixing” Spark in a way that will corrupt data or cause runtime
failures.

On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I have heard about many complaints about the old table insertion behavior.
> Blindly casting everything will leak the user mistake to a late stage of
> the data pipeline, and make it very hard to debug. When a user writes
> string values to an int column, it's probably a mistake and the columns are
> misordered in the INSERT statement. We should fail the query earlier and
> ask users to fix the mistake.
>
> In the meanwhile, I agree that the new table insertion behavior we
> introduced for Data Source V2 is too strict. It may fail valid queries
> unexpectedly.
>
> In general, I support the direction of following the ANSI SQL standard.
> But I'd like to do it with 2 steps:
> 1. only add cast when the assignment rule is satisfied. This should be the
> default behavior and we should provide a legacy config to restore to the
> old behavior.
> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
> Gaido is working on it already. This will have a config as well and by
> default we still return null.
>
> After doing this, the default behavior will be slightly different from the
> SQL standard (cast can return null), and users can turn on the ANSI mode to
> fully follow the SQL standard. This is much better than before and should
> prevent a lot of user mistakes. It's also a reasonable choice to me to not
> throw exceptions at runtime by default, as it's usually bad for
> long-running jobs.
>
> Thanks,
> Wenchen
>
> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Hi everyone,
>>
>> I would like to discuss the table insertion behavior of Spark. In the
>> current data source V2, only UpCast is allowed for table insertion. I think
>> following ANSI SQL is a better idea.
>> For more information, please read the Discuss: Follow ANSI SQL on table
>> insertion
>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>> Please let me know if you have any thoughts on this.
>>
>> Regards,
>> Gengliang
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to