Re: [VOTE] SPIP: Constraints in DSv2

Wenchen Fan Sat, 05 Apr 2025 10:39:56 -0700

+1

As Gengliang explained, the API allows the connectors to request Spark to
perform data validations, but connectors can also choose to do validation
by themselves. I think it's a reasonable design as not all connectors have
the ability to do data validation by themselves, such as file formats that
do not have a backend service.


On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <[email protected]> wrote:

> Hi Ángel,
>
> Thanks for the feedback. Besides the existing NOT NULL constraint, the
> proposal suggests enforcing only *check constraints *by default in Spark,
> as they’re straightforward and practical to validate at the engine level.
> Additionally, the SPIP proposes allowing connectors (like JDBC) to handle
> constraint validation externally:
>
> Some connectors, like JDBC, may skip validation in Spark and simply pass
>> the constraint through. These connectors must declare
>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating they
>> would handle constraint enforcement themselves.
>
>
> This approach should help improve data accuracy and consistency by clearly
> defining responsibilities and enforcing constraints closer to where they’re
> best managed.
>
>
> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua <
> [email protected]> wrote:
>
>> One thing is enforcing the quality of the data Spark is producing, and
>> another thing entirely is defining an external data model from Spark.
>>
>>
>> The proposal doesn’t necessarily facilitate data accuracy and
>> consistency. Defining constraints does help with that, but the question
>> remains: Is Spark truly responsible for enforcing those constraints on an
>> external system?
>>
>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<[email protected]>)
>> escribió:
>>
>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints should
>>>> be defined and enforced by the data sources themselves, not Spark. Spark is
>>>> a processing engine, and enforcing constraints at this level blurs
>>>> architectural boundaries, making Spark responsible for something it does
>>>> not control.
>>>>
>>>
>>> I disagree that this breaks the chain of responsibility. It may be quite
>>> the opposite, in fact. Spark is already responsible for enforcing NOT NULL
>>> constraints by adding AssertNotNull for required columns today. Connectors
>>> like Iceberg and Delta store constraint definitions but rely on engines
>>> like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE
>>> operations. Without this API, each connector would need to reimplement the
>>> same logic, creating duplication.
>>>
>>> The proposal is aligned with the SQL standard and other relational
>>> databases. In my view, it simply makes Spark a better engine, facilitates
>>> data accuracy and consistency, and enables performance optimizations.
>>>
>>> - Anton
>>>
>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua <
>>> [email protected]> пише:
>>>
>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints
>>>> should be defined and enforced by the data sources themselves, not Spark.
>>>> Spark is a processing engine, and enforcing constraints at this level blurs
>>>> architectural boundaries, making Spark responsible for something it does
>>>> not control.
>>>>
>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<[email protected]>)
>>>> escribió:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> +1 (non-binding)
>>>>> >>
>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <[email protected]>
>>>>> wrote:
>>>>> >>>
>>>>> >>> +1
>>>>> >>>
>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>> >>>>
>>>>> >>>> Hi all,
>>>>> >>>>
>>>>> >>>> I would like to start a vote on adding support for constraints to
>>>>> DSv2.
>>>>> >>>>
>>>>> >>>> Discussion thread:
>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj
>>>>> >>>> SPIP:
>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo
>>>>> >>>> PR with the API changes:
>>>>> https://github.com/apache/spark/pull/50253
>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207
>>>>> >>>>
>>>>> >>>> Please vote on the SPIP for the next 72 hours:
>>>>> >>>>
>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> >>>> [ ] +0
>>>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>>>> >>>>
>>>>> >>>> - Anton
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [email protected]
>>>>>
>>>>>

Re: [VOTE] SPIP: Constraints in DSv2

Reply via email to