Re: [Discuss] Follow ANSI SQL on table insertion

Wenchen Fan Mon, 29 Jul 2019 03:26:58 -0700

> I'm a big -1 on null values for invalid casts.

This is why we want to introduce the ANSI mode, so that invalid cast fails
at runtime. But we have to keep the null behavior for a while, to keep
backward compatibility. Spark returns null for invalid cast since the first
day of Spark SQL, we can't just change it without a way to restore to the
old behavior.


I'm OK with adding a strict mode for the upcast behavior in table
insertion, but I don't agree with making it the default. The default
behavior should be either the ANSI SQL behavior or the legacy Spark
behavior.

> other modes should be allowed only with strict warning the behavior will
be determined by the underlying sink.

Seems there is some misunderstanding. The table insertion behavior is fully
controlled by Spark. Spark decides when to add cast and Spark decided
whether invalid cast should return null or fail. The sink is only
responsible for writing data, not the type coercion/cast stuff.

On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I'm a big -1 on null values for invalid casts. This can lead to a lot of
> even more unexpected errors and runtime behavior since null is
>
> 1. Not allowed in all schemas (Leading to a runtime error anyway)
> 2. Is the same as delete in some systems (leading to data loss)
>
> And this would be dependent on the sink being used. Spark won't just be
> interacting with ANSI compliant sinks so I think it makes much more sense
> to be strict. I think Upcast mode is a sensible default and other modes
> should be allowed only with strict warning the behavior will be determined
> by the underlying sink.
>
> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Hi, all
>>
>> +1 for implementing this new store cast mode.
>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>> and I think this functionality could
>> promote migrations from existing DBMSs to Spark.
>>
>> The most important thing for DBMS users is that they could optionally
>> choose this mode when inserting data.
>> Therefore, I think it might be okay that the two modes (the current
>> upcast mode and the proposed store cast mode)
>> co-exist for INSERTs. (There is a room to discuss which mode  is enabled
>> by default though...)
>>
>> IMHO we'll provide three behaviours below for INSERTs;
>>  - upcast mode
>>  - ANSI store cast mode and runtime exceptions thrown for invalid values
>>  - ANSI store cast mode and null filled for invalid values
>>
>>
>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> Hi Ryan,
>>>
>>> Thanks for the suggestions on the proposal and doc.
>>> Currently, there is no data type validation in table insertion of V1. We
>>> are on the same page that we should improve it. But using UpCast is from
>>> one extreme to another. It is possible that many queries are broken after
>>> upgrading to Spark 3.0.
>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>> Timestamp type to Date Type, as there will be "precision loss". To me, the
>>> type coercion is reasonable and the "precision loss" is under expectation.
>>> This is very common in other SQL engines.
>>> As long as Spark is following the ANSI SQL store assignment rules, it is
>>> users' responsibility to take good care of the type coercion in data
>>> writing. I think it's the right decision.
>>>
>>> > But the new behavior is only applied in DataSourceV2, so it won’t
>>> affect existing jobs until sources move to v2 and break other behavior
>>> anyway.
>>> Eventually, most sources are supposed to be migrated to DataSourceV2 V2.
>>> I think we can discuss and make a decision now.
>>>
>>> > Fixing the silent corruption by adding a runtime exception is not a
>>> good option, either.
>>> The new optional mode proposed in
>>> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by
>>> default. This should be fine.
>>>
>>>
>>>
>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> I don't agree with handling literal values specially. Although Postgres
>>>> does it, I can't find anything about it in the SQL standard. And it
>>>> introduces inconsistent behaviors which may be strange to users:
>>>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>>>> * The same insert with a decimal column as input will fail even when a
>>>> decimal literal would succeed
>>>> * Similar insert queries with "literal" inputs can be constructed
>>>> through layers of indirection via views, inline views, CTEs, unions, etc.
>>>> Would those decimals be treated as columns and fail or would we attempt to
>>>> make them succeed as well? Would users find this behavior surprising?
>>>>
>>>> Silently corrupt data is bad, but this is the decision we made at the
>>>> beginning when design Spark behaviors. Whenever an error occurs, Spark
>>>> attempts to return null instead of runtime exception. Recently we provide
>>>> configs to make Spark fail at runtime for overflow, but that's another
>>>> story. Silently corrupt data is bad, runtime exception is bad, and
>>>> forbidding all the table insertions that may fail(even with very little
>>>> possibility) is also bad. We have to make trade-offs. The trade-offs we
>>>> made in this proposal are:
>>>> * forbid table insertions that are very like to fail, at compile time.
>>>> (things like writing string values to int column)
>>>> * allow table insertions that are not that likely to fail. If the data
>>>> is wrong, don't fail, insert null.
>>>> * provide a config to fail the insertion at runtime if the data is
>>>> wrong.
>>>>
>>>> >  But the new behavior is only applied in DataSourceV2, so it won’t
>>>> affect existing jobs until sources move to v2 and break other behavior
>>>> anyway.
>>>> When users write SQL queries, they don't care if a table is backed by
>>>> Data Source V1 or V2. We should make sure the table insertion behavior is
>>>> consistent and reasonable. Furthermore, users may even not care if the SQL
>>>> queries are run in Spark or other RDBMS, it's better to follow SQL standard
>>>> instead of introducing a Spark-specific behavior.
>>>>
>>>> We are not talking about a small use case like allowing writing decimal
>>>> literal to float column, we are talking about a big goal to make Spark
>>>> compliant to SQL standard, w.r.t.
>>>> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal is a
>>>> sub-task of it, to make the table insertion behavior follow SQL standard.
>>>>
>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I don’t think this is a good idea. Following the ANSI standard is
>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>
>>>>> From your proposal doc, ANSI allows implicitly casting from long to
>>>>> int (any numeric type to any other numeric type) and inserts NULL
>>>>> when a value overflows. That would drop data values and is not safe.
>>>>>
>>>>> Fixing the silent corruption by adding a runtime exception is not a
>>>>> good option, either. That puts off the problem until much of the job has
>>>>> completed, instead of catching the error at analysis time. It is better to
>>>>> catch this earlier during analysis than to run most of a job and then 
>>>>> fail.
>>>>>
>>>>> In addition, part of the justification for using the ANSI standard is
>>>>> to avoid breaking existing jobs. But the new behavior is only applied in
>>>>> DataSourceV2, so it won’t affect existing jobs until sources move to v2 
>>>>> and
>>>>> break other behavior anyway.
>>>>>
>>>>> I think that the correct solution is to go with the existing
>>>>> validation rules that require explicit casts to truncate values.
>>>>>
>>>>> That still leaves the use case that motivated this proposal, which is
>>>>> that floating point literals are parsed as decimals and fail simple insert
>>>>> statements. We already came up with two alternatives to fix that problem 
>>>>> in
>>>>> the DSv2 sync and I think it is a better idea to go with one of those
>>>>> instead of “fixing” Spark in a way that will corrupt data or cause runtime
>>>>> failures.
>>>>>
>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have heard about many complaints about the old table insertion
>>>>>> behavior. Blindly casting everything will leak the user mistake to a late
>>>>>> stage of the data pipeline, and make it very hard to debug. When a user
>>>>>> writes string values to an int column, it's probably a mistake and the
>>>>>> columns are misordered in the INSERT statement. We should fail the query
>>>>>> earlier and ask users to fix the mistake.
>>>>>>
>>>>>> In the meanwhile, I agree that the new table insertion behavior we
>>>>>> introduced for Data Source V2 is too strict. It may fail valid queries
>>>>>> unexpectedly.
>>>>>>
>>>>>> In general, I support the direction of following the ANSI SQL
>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>> 1. only add cast when the assignment rule is satisfied. This should
>>>>>> be the default behavior and we should provide a legacy config to restore 
>>>>>> to
>>>>>> the old behavior.
>>>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK
>>>>>> Marco Gaido is working on it already. This will have a config as well and
>>>>>> by default we still return null.
>>>>>>
>>>>>> After doing this, the default behavior will be slightly different
>>>>>> from the SQL standard (cast can return null), and users can turn on the
>>>>>> ANSI mode to fully follow the SQL standard. This is much better than 
>>>>>> before
>>>>>> and should prevent a lot of user mistakes. It's also a reasonable choice 
>>>>>> to
>>>>>> me to not throw exceptions at runtime by default, as it's usually bad for
>>>>>> long-running jobs.
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>> gengliang.w...@databricks.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I would like to discuss the table insertion behavior of Spark. In
>>>>>>> the current data source V2, only UpCast is allowed for table insertion. 
>>>>>>> I
>>>>>>> think following ANSI SQL is a better idea.
>>>>>>> For more information, please read the Discuss: Follow ANSI SQL on
>>>>>>> table insertion
>>>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gengliang
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [Discuss] Follow ANSI SQL on table insertion

Reply via email to