+1 On Tue, Mar 25, 2025 at 10:22 PM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.com> wrote:
> I meant ... a data validation API would be great, but why in the DSv2? > isn't data validation something more general? do we have to use DSv2 to > have our data validated? > > El mié, 26 mar 2025, 6:15, Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> escribió: > >> For me, data validation is one thing, and exporting that data to an >> external system is something entirely different. Should data validation be >> coupled with the external system? I don't think so. But since I'm the only >> one arguing against this proposal, does that mean I'm wrong? >> >> El mié, 26 mar 2025, 6:05, Wenchen Fan <cloud0...@gmail.com> escribió: >> >>> +1 >>> >>> As Gengliang explained, the API allows the connectors to request Spark >>> to perform data validations, but connectors can also choose to do >>> validation by themselves. I think it's a reasonable design as not all >>> connectors have the ability to do data validation by themselves, such as >>> file formats that do not have a backend service. >>> >>> On Wed, Mar 26, 2025 at 12:56 AM Gengliang Wang <ltn...@gmail.com> >>> wrote: >>> >>>> Hi Ángel, >>>> >>>> Thanks for the feedback. Besides the existing NOT NULL constraint, the >>>> proposal suggests enforcing only *check constraints *by default in >>>> Spark, as they’re straightforward and practical to validate at the engine >>>> level. Additionally, the SPIP proposes allowing connectors (like JDBC) to >>>> handle constraint validation externally: >>>> >>>> Some connectors, like JDBC, may skip validation in Spark and simply >>>>> pass the constraint through. These connectors must declare >>>>> ACCEPT_UNVALIDATED_CONSTRAINTS in their table capabilities, indicating >>>>> they >>>>> would handle constraint enforcement themselves. >>>> >>>> >>>> This approach should help improve data accuracy and consistency by >>>> clearly defining responsibilities and enforcing constraints closer to where >>>> they’re best managed. >>>> >>>> >>>> On Sat, Mar 22, 2025 at 12:32 AM Ángel Álvarez Pascua < >>>> angel.alvarez.pas...@gmail.com> wrote: >>>> >>>>> One thing is enforcing the quality of the data Spark is producing, and >>>>> another thing entirely is defining an external data model from Spark. >>>>> >>>>> >>>>> The proposal doesn’t necessarily facilitate data accuracy and >>>>> consistency. Defining constraints does help with that, but the question >>>>> remains: Is Spark truly responsible for enforcing those constraints on an >>>>> external system? >>>>> >>>>> El vie, 21 mar 2025 a las 21:29, Anton Okolnychyi (< >>>>> aokolnyc...@gmail.com>) escribió: >>>>> >>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>>>> should be defined and enforced by the data sources themselves, not >>>>>>> Spark. >>>>>>> Spark is a processing engine, and enforcing constraints at this level >>>>>>> blurs >>>>>>> architectural boundaries, making Spark responsible for something it does >>>>>>> not control. >>>>>>> >>>>>> >>>>>> I disagree that this breaks the chain of responsibility. It may be >>>>>> quite the opposite, in fact. Spark is already responsible for enforcing >>>>>> NOT >>>>>> NULL constraints by adding AssertNotNull for required columns today. >>>>>> Connectors like Iceberg and Delta store constraint definitions but rely >>>>>> on >>>>>> engines like Spark to enforce them during INSERT, DELETE, UPDATE, and >>>>>> MERGE >>>>>> operations. Without this API, each connector would need to reimplement >>>>>> the >>>>>> same logic, creating duplication. >>>>>> >>>>>> The proposal is aligned with the SQL standard and other relational >>>>>> databases. In my view, it simply makes Spark a better engine, facilitates >>>>>> data accuracy and consistency, and enables performance optimizations. >>>>>> >>>>>> - Anton >>>>>> >>>>>> пт, 21 бер. 2025 р. о 12:59 Ángel Álvarez Pascua < >>>>>> angel.alvarez.pas...@gmail.com> пише: >>>>>> >>>>>>> -1 (non-binding): Breaks the Chain of Responsibility. Constraints >>>>>>> should be defined and enforced by the data sources themselves, not >>>>>>> Spark. >>>>>>> Spark is a processing engine, and enforcing constraints at this level >>>>>>> blurs >>>>>>> architectural boundaries, making Spark responsible for something it does >>>>>>> not control. >>>>>>> >>>>>>> El vie, 21 mar 2025 a las 20:18, L. C. Hsieh (<vii...@gmail.com>) >>>>>>> escribió: >>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On Fri, Mar 21, 2025 at 12:13 PM huaxin gao <huaxin.ga...@gmail.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > +1 >>>>>>>> > >>>>>>>> > On Fri, Mar 21, 2025 at 12:08 PM Denny Lee <denny.g....@gmail.com> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> +1 (non-binding) >>>>>>>> >> >>>>>>>> >> On Fri, Mar 21, 2025 at 11:52 Gengliang Wang <ltn...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>> >>>>>>>> >>> +1 >>>>>>>> >>> >>>>>>>> >>> On Fri, Mar 21, 2025 at 11:46 AM Anton Okolnychyi < >>>>>>>> aokolnyc...@gmail.com> wrote: >>>>>>>> >>>> >>>>>>>> >>>> Hi all, >>>>>>>> >>>> >>>>>>>> >>>> I would like to start a vote on adding support for constraints >>>>>>>> to DSv2. >>>>>>>> >>>> >>>>>>>> >>>> Discussion thread: >>>>>>>> https://lists.apache.org/thread/njqjcryq0lot9rkbf10mtvf7d1t602bj >>>>>>>> >>>> SPIP: >>>>>>>> https://docs.google.com/document/d/1EHjB4W1LjiXxsK_G7067j9pPX0y15LUF1Z5DlUPoPIo >>>>>>>> >>>> PR with the API changes: >>>>>>>> https://github.com/apache/spark/pull/50253 >>>>>>>> >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-51207 >>>>>>>> >>>> >>>>>>>> >>>> Please vote on the SPIP for the next 72 hours: >>>>>>>> >>>> >>>>>>>> >>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>>> >>>> [ ] +0 >>>>>>>> >>>> [ ] -1: I don’t think this is a good idea because … >>>>>>>> >>>> >>>>>>>> >>>> - Anton >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>>