Re: Re: [VOTE] SPIP: Constraints in DSv2

Anton Okolnychyi Thu, 27 Mar 2025 08:42:03 -0700

Casting my own +1 (non-binding).

Angel, I echo what Wenchen said. Connectors and Spark interact via DSv2,
therefore it requires changes in that layer. It is going to be optional but
will make a ton of sense for many connectors, especially in modern open
table formats that decouple table metadata from engines. All the
parsing/validation/enforcement will be generic and can be reused beyond
DSv2, if ever needed in the future.


- Anton

чт, 27 бер. 2025 р. о 00:59 beliefer <[email protected]> пише:

> +1
>
>
> 在 2025-03-26 14:45:09，"Chao Sun" <[email protected]> 写道：
>
> +1
>
> On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua <
> [email protected]> wrote:
>
>> I meant ... a data validation API would be great, but why in the  DSv2?
>> isn't data validation something more general? do we have to use DSv2 to
>> have our data validated?
>>
>> El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua <
>> [email protected]> escribió:
>>
>>> For me, data validation is one thing, and exporting that data to an
>>> external system is something entirely different. Should data validation be
>>> coupled with the external system? I don't think so. But since I'm the only
>>> one arguing against this proposal, does that mean I'm wrong?
>>>
>>> El mié, 26 mar 2025, 6:05, Wenchen Fan <[email protected]> escribió:
>>>
>>>> +1
>>>>
>>>> As Gengliang explained, the API allows the connectors to request Spark
>>>> to perform data validations, but connectors can also choose to do
>>>> validation by themselves. I think it's a reasonable design as not all
>>>> connectors have the ability to do data validation by themselves, such as
>>>> file formats that do not have a backend service.
>>>>
>>>> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ángel,
>>>>>
>>>>> Thanks for the feedback. Besides the existing NOT NULL constraint, the
>>>>> proposal suggests enforcing only *check constraints *by default in
>>>>> Spark, as they’re straightforward and practical to validate at the engine
>>>>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to
>>>>> handle constraint validation externally:
>>>>>
>>>>> Some connectors, like JDBC, may skip validation in Spark and simply
>>>>>> pass the constraint through. These connectors must declare
>>>>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating 
>>>>>> they
>>>>>> would handle constraint enforcement themselves.
>>>>>
>>>>>
>>>>> This approach should help improve data accuracy and consistency by
>>>>> clearly defining responsibilities and enforcing constraints closer to 
>>>>> where
>>>>> they’re best managed.
>>>>>
>>>>>
>>>>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> One thing is enforcing the quality of the data Spark is producing,
>>>>>> and another thing entirely is defining an external data model from Spark.
>>>>>>
>>>>>>
>>>>>> The proposal doesn’t necessarily facilitate data accuracy and
>>>>>> consistency. Defining constraints does help with that, but the question
>>>>>> remains: Is Spark truly responsible for enforcing those constraints on an
>>>>>> external system?
>>>>>>
>>>>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<
>>>>>> [email protected]>) escribió:
>>>>>>
>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>>> Spark.
>>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>>> blurs
>>>>>>>> architectural boundaries, making Spark responsible for something it 
>>>>>>>> does
>>>>>>>> not control.
>>>>>>>>
>>>>>>>
>>>>>>> I disagree that this breaks the chain of responsibility. It may be
>>>>>>> quite the opposite, in fact. Spark is already responsible for enforcing 
>>>>>>> NOT
>>>>>>> NULL constraints by adding AssertNotNull for required columns today.
>>>>>>> Connectors like Iceberg and Delta store constraint definitions but rely 
>>>>>>> on
>>>>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and 
>>>>>>> MERGE
>>>>>>> operations. Without this API, each connector would need to reimplement 
>>>>>>> the
>>>>>>> same logic, creating duplication.
>>>>>>>
>>>>>>> The proposal is aligned with the SQL standard and other relational
>>>>>>> databases. In my view, it simply makes Spark a better engine, 
>>>>>>> facilitates
>>>>>>> data accuracy and consistency, and enables performance optimizations.
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
>>>>>>> [email protected]> пише:
>>>>>>>
>>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>>> Spark.
>>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>>> blurs
>>>>>>>> architectural boundaries, making Spark responsible for something it 
>>>>>>>> does
>>>>>>>> not control.
>>>>>>>>
>>>>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<[email protected]>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > +1
>>>>>>>>> >
>>>>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>
>>>>>>>>> >> +1 (non-binding)
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> +1
>>>>>>>>> >>>
>>>>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Hi all,
>>>>>>>>> >>>>
>>>>>>>>> >>>> I would like to start a vote on adding support for
>>>>>>>>> constraints to DSv2.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Discussion thread:
>>>>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>>>>>>> >>>> SPIP:
>>>>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>>>>>>> >>>> PR with the API changes:
>>>>>>>>> https://github.com/apache/spark/pull/50253
>>>>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>>>>>>>> >>>>
>>>>>>>>> >>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>>> >>>>
>>>>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>> >>>> [ ] +0
>>>>>>>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>> >>>>
>>>>>>>>> >>>> - Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>

Re: Re: [VOTE] SPIP: Constraints in DSv2

Reply via email to