+1 As Gengliang explained, the API allows the connectors to request Spark to perform data validations, but connectors can also choose to do validation by themselves. I think it's a reasonable design as not all connectors have the ability to do data validation by themselves, such as file formats that do not have a backend service.
On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> wrote: > Hi Ángel, > > Thanks for the feedback. Besides the existing NOT NULL constraint, the > proposal suggests enforcing only *check constraints *by default in Spark, > as they’re straightforward and practical to validate at the engine level. > Additionally, the SPIP proposes allowing connectors (like JDBC) to handle > constraint validation externally: > > Some connectors, like JDBC, may skip validation in Spark and simply pass >> the constraint through. These connectors must declare >> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating they >> would handle constraint enforcement themselves. > > > This approach should help improve data accuracy and consistency by clearly > defining responsibilities and enforcing constraints closer to where they’re > best managed. > > > On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > >> One thing is enforcing the quality of the data Spark is producing, and >> another thing entirely is defining an external data model from Spark. >> >> >> The proposal doesn’t necessarily facilitate data accuracy and >> consistency. Defining constraints does help with that, but the question >> remains: Is Spark truly responsible for enforcing those constraints on an >> external system? >> >> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (<aokolnyc...@gmail.com>) >> escribió: >> >>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints should >>>> be defined and enforced by the data sources themselves, not Spark. Spark is >>>> a processing engine, and enforcing constraints at this level blurs >>>> architectural boundaries, making Spark responsible for something it does >>>> not control. >>>> >>> >>> I disagree that this breaks the chain of responsibility. It may be quite >>> the opposite, in fact. Spark is already responsible for enforcing NOT NULL >>> constraints by adding AssertNotNull for required columns today. Connectors >>> like Iceberg and Delta store constraint definitions but rely on engines >>> like Spark to enforce them during INSERT, DELETE, UPDATE, and MERGE >>> operations. Without this API, each connector would need to reimplement the >>> same logic, creating duplication. >>> >>> The proposal is aligned with the SQL standard and other relational >>> databases. In my view, it simply makes Spark a better engine, facilitates >>> data accuracy and consistency, and enables performance optimizations. >>> >>> - Anton >>> >>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua < >>> angel.alvarez.pas...@gmail.com> пише: >>> >>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>> should be defined and enforced by the data sources themselves, not Spark. >>>> Spark is a processing engine, and enforcing constraints at this level blurs >>>> architectural boundaries, making Spark responsible for something it does >>>> not control. >>>> >>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>) >>>> escribió: >>>> >>>>> +1 >>>>> >>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com> >>>>> wrote: >>>>> > >>>>> > +1 >>>>> > >>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com> >>>>> wrote: >>>>> >> >>>>> >> +1 (non-binding) >>>>> >> >>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com> >>>>> wrote: >>>>> >>> >>>>> >>> +1 >>>>> >>> >>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi < >>>>> aokolnyc...@gmail.com> wrote: >>>>> >>>> >>>>> >>>> Hi all, >>>>> >>>> >>>>> >>>> I would like to start a vote on adding support for constraints to >>>>> DSv2. >>>>> >>>> >>>>> >>>> Discussion thread: >>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj >>>>> >>>> SPIP: >>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo >>>>> >>>> PR with the API changes: >>>>> https://github.com/apache/spark/pull/50253 >>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207 >>>>> >>>> >>>>> >>>> Please vote on the SPIP for the next 72 hours: >>>>> >>>> >>>>> >>>> [ ] +1: Accept the proposal as an official SPIP >>>>> >>>> [ ] +0 >>>>> >>>> [ ] -1: I don’t think this is a good idea because … >>>>> >>>> >>>>> >>>> - Anton >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>>