Re: [VOTE] SPIP: Constraints in DSv2

Chao Sun Tue, 25 Mar 2025 23:47:41 -0700

+1

On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:


> I meant ... a data validation API would be great, but why in the  DSv2?
> isn't data validation something more general? do we have to use DSv2 to
> have our data validated?
>
> El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> escribió:
>
>> For me, data validation is one thing, and exporting that data to an
>> external system is something entirely different. Should data validation be
>> coupled with the external system? I don't think so. But since I'm the only
>> one arguing against this proposal, does that mean I'm wrong?
>>
>> El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió:
>>
>>> +1
>>>
>>> As Gengliang explained, the API allows the connectors to request Spark
>>> to perform data validations, but connectors can also choose to do
>>> validation by themselves. I think it's a reasonable design as not all
>>> connectors have the ability to do data validation by themselves, such as
>>> file formats that do not have a backend service.
>>>
>>> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ángel,
>>>>
>>>> Thanks for the feedback. Besides the existing NOT NULL constraint, the
>>>> proposal suggests enforcing only *check constraints *by default in
>>>> Spark, as they’re straightforward and practical to validate at the engine
>>>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to
>>>> handle constraint validation externally:
>>>>
>>>> Some connectors, like JDBC, may skip validation in Spark and simply
>>>>> pass the constraint through. These connectors must declare
>>>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating 
>>>>> they
>>>>> would handle constraint enforcement themselves.
>>>>
>>>>
>>>> This approach should help improve data accuracy and consistency by
>>>> clearly defining responsibilities and enforcing constraints closer to where
>>>> they’re best managed.
>>>>
>>>>
>>>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <
>>>> angel.alvarez.pas...@gmail.com> wrote:
>>>>
>>>>> One thing is enforcing the quality of the data Spark is producing, and
>>>>> another thing entirely is defining an external data model from Spark.
>>>>>
>>>>>
>>>>> The proposal doesn’t necessarily facilitate data accuracy and
>>>>> consistency. Defining constraints does help with that, but the question
>>>>> remains: Is Spark truly responsible for enforcing those constraints on an
>>>>> external system?
>>>>>
>>>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<
>>>>> aokolnyc...@gmail.com>) escribió:
>>>>>
>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>> Spark.
>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>> blurs
>>>>>>> architectural boundaries, making Spark responsible for something it does
>>>>>>> not control.
>>>>>>>
>>>>>>
>>>>>> I disagree that this breaks the chain of responsibility. It may be
>>>>>> quite the opposite, in fact. Spark is already responsible for enforcing 
>>>>>> NOT
>>>>>> NULL constraints by adding AssertNotNull for required columns today.
>>>>>> Connectors like Iceberg and Delta store constraint definitions but rely 
>>>>>> on
>>>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and 
>>>>>> MERGE
>>>>>> operations. Without this API, each connector would need to reimplement 
>>>>>> the
>>>>>> same logic, creating duplication.
>>>>>>
>>>>>> The proposal is aligned with the SQL standard and other relational
>>>>>> databases. In my view, it simply makes Spark a better engine, facilitates
>>>>>> data accuracy and consistency, and enables performance optimizations.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
>>>>>> angel.alvarez.pas...@gmail.com> пише:
>>>>>>
>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>>>>> should be defined and enforced by the data sources themselves, not 
>>>>>>> Spark.
>>>>>>> Spark is a processing engine, and enforcing constraints at this level 
>>>>>>> blurs
>>>>>>> architectural boundaries, making Spark responsible for something it does
>>>>>>> not control.
>>>>>>>
>>>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>)
>>>>>>> escribió:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > +1
>>>>>>>> >
>>>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> +1 (non-binding)
>>>>>>>> >>
>>>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>
>>>>>>>> >>> +1
>>>>>>>> >>>
>>>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>>>>>>> aokolnyc...@gmail.com> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> Hi all,
>>>>>>>> >>>>
>>>>>>>> >>>> I would like to start a vote on adding support for constraints
>>>>>>>> to DSv2.
>>>>>>>> >>>>
>>>>>>>> >>>> Discussion thread:
>>>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>>>>>> >>>> SPIP:
>>>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>>>>>> >>>> PR with the API changes:
>>>>>>>> https://github.com/apache/spark/pull/50253
>>>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>>>>>>> >>>>
>>>>>>>> >>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>> >>>>
>>>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>> >>>> [ ] +0
>>>>>>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>> >>>>
>>>>>>>> >>>> - Anton
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>

Re: [VOTE] SPIP: Constraints in DSv2

Reply via email to