Re: [Discuss] Follow ANSI SQL on table insertion

Reynold Xin Wed, 31 Jul 2019 09:58:04 -0700

Matt what do you mean by maximizing 3, while allowing not throwing errors when 
any operations overflow? Those two seem contradicting.


On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> I’m -1, simply from disagreeing with the premise that we can afford to not
> be maximal on standard 3. The correctness of the data is non-negotiable,
> and whatever solution we settle on cannot silently adjust the user’s data
> under any circumstances.
> 
> 
> 
>  
> 
> 
> 
> I think the existing behavior is fine, or perhaps the behavior can be
> flagged by the destination writer at write time.
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Hyukjin Kwon < gurwls...@gmail.com >
> *Date:* Monday, July 29, 2019 at 11:33 PM
> *To:* Wenchen Fan < cloud0...@gmail.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin....@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> From my look, +1 on the proposal, considering ASCI and other DBMSes in
> general.
> 
> 
> 
> 
>  
> 
> 
> 
> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
> 
> 
> 
> 
>> 
>> 
>> We can add a config for a certain behavior if it makes sense, but the most
>> important thing we want to reach an agreement here is: what should be the
>> default behavior?
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Let's explore the solution space of table insertion behavior first:
>> 
>> 
>> 
>> 
>> At compile time,
>> 
>> 
>> 
>> 
>> 1. always add cast
>> 
>> 
>> 
>> 
>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>> int is forbidden but long to int is allowed)
>> 
>> 
>> 
>> 
>> 3. only add cast if it's 100% safe
>> 
>> 
>> 
>> 
>> At runtime,
>> 
>> 
>> 
>> 
>> 1. return null for invalid operations
>> 
>> 
>> 
>> 
>> 2. throw exceptions at runtime for invalid operations
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The standards to evaluate a solution:
>> 
>> 
>> 
>> 
>> 1. How robust the query execution is. For example, users usually don't
>> want to see the query fails midway.
>> 
>> 
>> 
>> 
>> 2. how tolerant to user queries. For example, a user would like to write
>> long values to an int column as he knows all the long values won't exceed
>> int range.
>> 
>> 
>> 
>> 
>> 3. How clean the result is. For example, users usually don't want to see
>> silently corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V1 tables: always add cast and
>> return null for invalid operations. This maximizes standard 1 and 2, but
>> the result is least clean and users are very likely to see silently
>> corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>> queries may fail to compile, even if these queries can run on other SQL
>> systems. Note that, people can still see silently corrupted data because
>> cast is not the only one that can return corrupted data. Simple operations
>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The proposal here: add cast following ANSI SQL store assignment rule, and
>> return null for invalid operations. This maximizes standard 1, and also
>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>> compile in other mainstream databases as well. I think that's tolerant
>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>> many invalid operations already.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Technically we can't make the result 100% clean at compile-time, we have
>> to handle things like overflow at runtime. I think the new proposal makes
>> more sense as the default behavior.
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer < russell.spit...@gmail.com
>> > wrote:
>> 
>> 
>> 
>>> 
>>> 
>>> I understand spark is making the decisions, i'm say the actual final
>>> effect of the null decision would be different depending on the insertion
>>> target if the target has different behaviors for null.
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> > I'm a big -1 on null values for invalid casts.
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> 
>>>> This is why we want to introduce the ANSI mode, so that invalid cast fails
>>>> at runtime. But we have to keep the null behavior for a while, to keep
>>>> backward compatibility. Spark returns null for invalid cast since the
>>>> first day of Spark SQL, we can't just change it without a way to restore
>>>> to the old behavior.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I'm OK with adding a strict mode for the upcast behavior in table
>>>> insertion, but I don't agree with making it the default. The default
>>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>>> behavior.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> 
>>>> > other modes should be allowed only with strict warning the behavior will
>>>> be determined by the underlying sink.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Seems there is some misunderstanding. The table insertion behavior is
>>>> fully controlled by Spark. Spark decides when to add cast and Spark
>>>> decided whether invalid cast should return null or fail. The sink is only
>>>> responsible for writing data, not the type coercion/cast stuff.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> 
>>>> 
>>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer < 
>>>> russell.spit...@gmail.com
>>>> > wrote:
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I'm a big -1 on null values for invalid casts. This can lead to a lot of
>>>>> even more unexpected errors and runtime behavior since null is 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>>>> 2. Is the same as delete in some systems (leading to data loss)
>>>>> 
>>>>> And this would be dependent on the sink being used. Spark won't just be
>>>>> interacting with ANSI compliant sinks so I think it makes much more sense
>>>>> to be strict. I think Upcast mode is a sensible default and other modes
>>>>> should be allowed only with strict warning the behavior will be determined
>>>>> by the underlying sink.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro < linguin....@gmail.com >
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi, all
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> +1 for implementing this new store cast mode.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs 
>>>>>> and
>>>>>> I think this functionality could
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> promote migrations from existing DBMSs to Spark. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> The most important thing for DBMS users is that they could optionally
>>>>>> choose this mode when inserting data.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Therefore, I think it might be okay that the two modes (the current 
>>>>>> upcast
>>>>>> mode and the proposed store cast mode)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> co-exist for INSERTs. (There is a room to discuss which mode  is enabled
>>>>>> by default though...)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  - upcast mode
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  - ANSI store cast mode and runtime exceptions thrown for invalid values
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang < 
>>>>>> gengliang.w...@databricks.com
>>>>>> > wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Ryan,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Currently, there is no data type validation in table insertion of V1. We
>>>>>>> are on the same page that we should improve it. But using UpCast is from
>>>>>>> one extreme to another. It is possible that many queries are broken 
>>>>>>> after
>>>>>>> upgrading to Spark 3.0. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>>>>>> Timestamp type to Date Type, as there will be "precision loss". To me, 
>>>>>>> the
>>>>>>> type coercion is reasonable and the "precision loss" is under 
>>>>>>> expectation.
>>>>>>> This is very common in other SQL engines. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> As long as Spark is following the ANSI SQL store assignment rules, it is
>>>>>>> users' responsibility to take good care of the type coercion in data
>>>>>>> writing. I think it's the right decision.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> > But the new behavior is only applied in DataSourceV2, so it won’t 
>>>>>>> > affect
>>>>>>> existing jobs until sources move to v2 and break other behavior anyway.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2 
>>>>>>> V2. I
>>>>>>> think we can discuss and make a decision now.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> > Fixing the silent corruption by adding a runtime exception is not a 
>>>>>>> > good
>>>>>>> option, either. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The new optional mode proposed in 
>>>>>>> https://issues.apache.org/jira/browse/SPARK-28512
>>>>>>> [issues.apache.org] (
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28512&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=kluCSIsFxJd_F1ljwRLu_dVRBvOYekEogDH4KWhfkkA&e=
>>>>>>> ) is disabled by default. This should be fine.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan < cloud0...@gmail.com > 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't agree with handling literal values specially. Although Postgres
>>>>>>>> does it, I can't find anything about it in the SQL standard. And it
>>>>>>>> introduces inconsistent behaviors which may be strange to users:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>>>>>>>> * The same insert with a decimal column as input will fail even when a
>>>>>>>> decimal literal would succeed
>>>>>>>> * Similar insert queries with "literal" inputs can be constructed 
>>>>>>>> through
>>>>>>>> layers of indirection via views, inline views, CTEs, unions, etc. Would
>>>>>>>> those decimals be treated as columns and fail or would we attempt to 
>>>>>>>> make
>>>>>>>> them succeed as well? Would users find this behavior surprising?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Silently corrupt data is bad, but this is the decision we made at the
>>>>>>>> beginning when design Spark behaviors. Whenever an error occurs, Spark
>>>>>>>> attempts to return null instead of runtime exception. Recently we 
>>>>>>>> provide
>>>>>>>> configs to make Spark fail at runtime for overflow, but that's another
>>>>>>>> story. Silently corrupt data is bad, runtime exception is bad, and
>>>>>>>> forbidding all the table insertions that may fail(even with very little
>>>>>>>> possibility) is also bad. We have to make trade-offs. The trade-offs we
>>>>>>>> made in this proposal are:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> * forbid table insertions that are very like to fail, at compile time.
>>>>>>>> (things like writing string values to int column)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> * allow table insertions that are not that likely to fail. If the data 
>>>>>>>> is
>>>>>>>> wrong, don't fail, insert null.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> * provide a config to fail the insertion at runtime if the data is 
>>>>>>>> wrong.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> >  But the new behavior is only applied in DataSourceV2, so it won’t
>>>>>>>> affect existing jobs until sources move to v2 and break other behavior
>>>>>>>> anyway.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> When users write SQL queries, they don't care if a table is backed by 
>>>>>>>> Data
>>>>>>>> Source V1 or V2. We should make sure the table insertion behavior is
>>>>>>>> consistent and reasonable. Furthermore, users may even not care if the 
>>>>>>>> SQL
>>>>>>>> queries are run in Spark or other RDBMS, it's better to follow SQL
>>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We are not talking about a small use case like allowing writing decimal
>>>>>>>> literal to float column, we are talking about a big goal to make Spark
>>>>>>>> compliant to SQL standard, w.r.t. 
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-26217
>>>>>>>> [issues.apache.org] (
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D26217&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=Y9ReXszbCYw7dLvhzbVJD7wYppANZpZtUkoSNJ-MJC4&e=
>>>>>>>> ) . This proposal is a sub-task of it, to make the table insertion
>>>>>>>> behavior follow SQL standard.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue < rb...@netflix.com > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I don’t think this is a good idea. Following the ANSI standard is 
>>>>>>>>> usually
>>>>>>>>> fine, but here it would *silently corrupt data*.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> From your proposal doc, ANSI allows implicitly casting from long to 
>>>>>>>>> int (any
>>>>>>>>> numeric type to any other numeric type) and inserts NULL when a value
>>>>>>>>> overflows. That would drop data values and is not safe.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Fixing the silent corruption by adding a runtime exception is not a 
>>>>>>>>> good
>>>>>>>>> option, either. That puts off the problem until much of the job has
>>>>>>>>> completed, instead of catching the error at analysis time. It is 
>>>>>>>>> better to
>>>>>>>>> catch this earlier during analysis than to run most of a job and then
>>>>>>>>> fail.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> In addition, part of the justification for using the ANSI standard is 
>>>>>>>>> to
>>>>>>>>> avoid breaking existing jobs. But the new behavior is only applied in
>>>>>>>>> DataSourceV2, so it won’t affect existing jobs until sources move to 
>>>>>>>>> v2
>>>>>>>>> and break other behavior anyway.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I think that the correct solution is to go with the existing 
>>>>>>>>> validation
>>>>>>>>> rules that require explicit casts to truncate values.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> That still leaves the use case that motivated this proposal, which is 
>>>>>>>>> that
>>>>>>>>> floating point literals are parsed as decimals and fail simple insert
>>>>>>>>> statements. We already came up with two alternatives to fix that 
>>>>>>>>> problem
>>>>>>>>> in the DSv2 sync and I think it is a better idea to go with one of 
>>>>>>>>> those
>>>>>>>>> instead of “fixing” Spark in a way that will corrupt data or cause 
>>>>>>>>> runtime
>>>>>>>>> failures.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan < cloud0...@gmail.com > 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I have heard about many complaints about the old table insertion 
>>>>>>>>>> behavior.
>>>>>>>>>> Blindly casting everything will leak the user mistake to a late 
>>>>>>>>>> stage of
>>>>>>>>>> the data pipeline, and make it very hard to debug. When a user writes
>>>>>>>>>> string values to an int column, it's probably a mistake and the 
>>>>>>>>>> columns
>>>>>>>>>> are misordered in the INSERT statement. We should fail the query 
>>>>>>>>>> earlier
>>>>>>>>>> and ask users to fix the mistake.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> In the meanwhile, I agree that the new table insertion behavior we
>>>>>>>>>> introduced for Data Source V2 is too strict. It may fail valid 
>>>>>>>>>> queries
>>>>>>>>>> unexpectedly.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> In general, I support the direction of following the ANSI SQL 
>>>>>>>>>> standard.
>>>>>>>>>> But I'd like to do it with 2 steps:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 1. only add cast when the assignment rule is satisfied. This should 
>>>>>>>>>> be the
>>>>>>>>>> default behavior and we should provide a legacy config to restore to 
>>>>>>>>>> the
>>>>>>>>>> old behavior.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK 
>>>>>>>>>> Marco
>>>>>>>>>> Gaido is working on it already. This will have a config as well and 
>>>>>>>>>> by
>>>>>>>>>> default we still return null.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> After doing this, the default behavior will be slightly different 
>>>>>>>>>> from the
>>>>>>>>>> SQL standard (cast can return null), and users can turn on the ANSI 
>>>>>>>>>> mode
>>>>>>>>>> to fully follow the SQL standard. This is much better than before and
>>>>>>>>>> should prevent a lot of user mistakes. It's also a reasonable choice 
>>>>>>>>>> to me
>>>>>>>>>> to not throw exceptions at runtime by default, as it's usually bad 
>>>>>>>>>> for
>>>>>>>>>> long-running jobs.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Wenchen 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang < 
>>>>>>>>>> gengliang.w...@databricks.com
>>>>>>>>>> > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I would like to discuss the table insertion behavior of Spark. In 
>>>>>>>>>>> the
>>>>>>>>>>> current data source V2, only UpCast is allowed for table insertion. 
>>>>>>>>>>> I
>>>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> For more information, please read the Discuss: Follow ANSI SQL on 
>>>>>>>>>>> table
>>>>>>>>>>> insertion [docs.google.com] (
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1b9nnWWbKVDRp7lpzhQS1buv1-5FlDzWIZY2ApFs5rBcGI_edit-3Fusp-3Dsharing&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=PcMDvxQsrOa4XX4-xjnVwG6ikOifZ3wN1HMMZNx7hfU&s=bLJxfpdKatSv5gaYIKVWx29RmEBKz-DTtRwuYd7lYks&e=
>>>>>>>>>>> )
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Gengliang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Ryan Blue
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Software Engineer
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Netflix
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [Discuss] Follow ANSI SQL on table insertion

Reply via email to