Casting my own +1 (non-binding). Angel, I echo what Wenchen said. Connectors and Spark interact via DSv2, therefore it requires changes in that layer. It is going to be optional but will make a ton of sense for many connectors, especially in modern open table formats that decouple table metadata from engines. All the parsing/validation/enforcement will be generic and can be reused beyond DSv2, if ever needed in the future.
- Anton чт, 27 бер. 2025 р. о 00:59 beliefer <belie...@163.com> пише: > +1 > > > 在 2025-03-26 14:45:09,"Chao Sun" <sunc...@apache.org> 写道: > > +1 > > On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > >> I meant ... a data validation API would be great, but why in the DSv2? >> isn't data validation something more general? do we have to use DSv2 to >> have our data validated? >> >> El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua < >> angel.alvarez.pas...@gmail.com> escribió: >> >>> For me, data validation is one thing, and exporting that data to an >>> external system is something entirely different. Should data validation be >>> coupled with the external system? I don't think so. But since I'm the only >>> one arguing against this proposal, does that mean I'm wrong? >>> >>> El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió: >>> >>>> +1 >>>> >>>> As Gengliang explained, the API allows the connectors to request Spark >>>> to perform data validations, but connectors can also choose to do >>>> validation by themselves. I think it's a reasonable design as not all >>>> connectors have the ability to do data validation by themselves, such as >>>> file formats that do not have a backend service. >>>> >>>> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> >>>> wrote: >>>> >>>>> Hi Ángel, >>>>> >>>>> Thanks for the feedback. Besides the existing NOT NULL constraint, the >>>>> proposal suggests enforcing only *check constraints *by default in >>>>> Spark, as they’re straightforward and practical to validate at the engine >>>>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to >>>>> handle constraint validation externally: >>>>> >>>>> Some connectors, like JDBC, may skip validation in Spark and simply >>>>>> pass the constraint through. These connectors must declare >>>>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating >>>>>> they >>>>>> would handle constraint enforcement themselves. >>>>> >>>>> >>>>> This approach should help improve data accuracy and consistency by >>>>> clearly defining responsibilities and enforcing constraints closer to >>>>> where >>>>> they’re best managed. >>>>> >>>>> >>>>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua < >>>>> angel.alvarez.pas...@gmail.com> wrote: >>>>> >>>>>> One thing is enforcing the quality of the data Spark is producing, >>>>>> and another thing entirely is defining an external data model from Spark. >>>>>> >>>>>> >>>>>> The proposal doesn’t necessarily facilitate data accuracy and >>>>>> consistency. Defining constraints does help with that, but the question >>>>>> remains: Is Spark truly responsible for enforcing those constraints on an >>>>>> external system? >>>>>> >>>>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (< >>>>>> aokolnyc...@gmail.com>) escribió: >>>>>> >>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>>>>> should be defined and enforced by the data sources themselves, not >>>>>>>> Spark. >>>>>>>> Spark is a processing engine, and enforcing constraints at this level >>>>>>>> blurs >>>>>>>> architectural boundaries, making Spark responsible for something it >>>>>>>> does >>>>>>>> not control. >>>>>>>> >>>>>>> >>>>>>> I disagree that this breaks the chain of responsibility. It may be >>>>>>> quite the opposite, in fact. Spark is already responsible for enforcing >>>>>>> NOT >>>>>>> NULL constraints by adding AssertNotNull for required columns today. >>>>>>> Connectors like Iceberg and Delta store constraint definitions but rely >>>>>>> on >>>>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and >>>>>>> MERGE >>>>>>> operations. Without this API, each connector would need to reimplement >>>>>>> the >>>>>>> same logic, creating duplication. >>>>>>> >>>>>>> The proposal is aligned with the SQL standard and other relational >>>>>>> databases. In my view, it simply makes Spark a better engine, >>>>>>> facilitates >>>>>>> data accuracy and consistency, and enables performance optimizations. >>>>>>> >>>>>>> - Anton >>>>>>> >>>>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua < >>>>>>> angel.alvarez.pas...@gmail.com> пише: >>>>>>> >>>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>>>>> should be defined and enforced by the data sources themselves, not >>>>>>>> Spark. >>>>>>>> Spark is a processing engine, and enforcing constraints at this level >>>>>>>> blurs >>>>>>>> architectural boundaries, making Spark responsible for something it >>>>>>>> does >>>>>>>> not control. >>>>>>>> >>>>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>) >>>>>>>> escribió: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao < >>>>>>>>> huaxin.ga...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > +1 >>>>>>>>> > >>>>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee < >>>>>>>>> denny.g....@gmail.com> wrote: >>>>>>>>> >> >>>>>>>>> >> +1 (non-binding) >>>>>>>>> >> >>>>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>> >>>>>>>>> >>> +1 >>>>>>>>> >>> >>>>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi < >>>>>>>>> aokolnyc...@gmail.com> wrote: >>>>>>>>> >>>> >>>>>>>>> >>>> Hi all, >>>>>>>>> >>>> >>>>>>>>> >>>> I would like to start a vote on adding support for >>>>>>>>> constraints to DSv2. >>>>>>>>> >>>> >>>>>>>>> >>>> Discussion thread: >>>>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj >>>>>>>>> >>>> SPIP: >>>>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo >>>>>>>>> >>>> PR with the API changes: >>>>>>>>> https://github.com/apache/spark/pull/50253 >>>>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207 >>>>>>>>> >>>> >>>>>>>>> >>>> Please vote on the SPIP for the next 72 hours: >>>>>>>>> >>>> >>>>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>>> >>>> [ ] +0 >>>>>>>>> >>>> [ ] -1: I don’t think this is a good idea because … >>>>>>>>> >>>> >>>>>>>>> >>>> - Anton >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>>