I think it might be a bit more complicated than this (but happy to be proved wrong).
I have a minimum working example at: https://github.com/PhillHenry/SparkConstraints.git that runs out-of-the-box (mvn test) and demonstrates what I am trying to achieve. A test persists a DataFrame that conforms to the contract and demonstrates that one that does not, throws an Exception. I've had to slightly modify 3 Spark files to add the data contract functionality. If you can think of a more elegant solution, I'd be very grateful. Regards, Phillip On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <deepakmc...@gmail.com> wrote: > It can be as simple as adding a function to the spark session builder > specifically on the read which can take the yaml file(definition if data > co tracts to be in yaml) and apply it to the data frame . > It can ignore the rows not matching the data contracts defined in the yaml > . > > Thanks > Deepak > > On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <londonjava...@gmail.com> > wrote: > >> For my part, I'm not too concerned about the mechanism used to implement >> the validation as long as it's rich enough to express the constraints. >> >> I took a look at JSON Schemas (for which there are a number of JVM >> implementations) but I don't think it can handle more complex data types >> like dates. Maybe Elliot can comment on this? >> >> Ideally, *any* reasonable mechanism could be plugged in. >> >> But what struck me from trying to write a Proof of Concept was that it >> was quite hard to inject my code into this particular area of the Spark >> machinery. It could very well be due to my limited understanding of the >> codebase, but it seemed the Spark code would need a bit of a refactor >> before a component could be injected. Maybe people in this forum with >> greater knowledge in this area could comment? >> >> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear >> to be attempting to implement data contracts within their ecosystem. >> Unfortunately, I think it's closed source and Python only. >> >> Regards, >> >> Phillip >> >> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> It would be interesting if we think about creating a contract validation >>> library written in JSON format. This would ensure a validation mechanism >>> that will rely on this library and could be shared among relevant parties. >>> Will that be a starting point? >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <j...@jgp.net> wrote: >>> >>>> Hi, >>>> >>>> While I was at PayPal, we open sourced a template of Data Contract, it >>>> is here: https://github.com/paypal/data-contract-template. Companies >>>> like GX (Great Expectations) are interested in using it. >>>> >>>> Spark could read some elements form it pretty easily, like schema >>>> validation, some rules validations. Spark could also generate an embryo of >>>> data contracts… >>>> >>>> —jgp >>>> >>>> >>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> From my limited understanding of data contracts, there are two factors >>>> that deem necessary. >>>> >>>> >>>> 1. procedure matter >>>> 2. technical matter >>>> >>>> I mean this is nothing new. Some tools like Cloud data fusion can >>>> assist when the procedures are validated. Simply "The process of >>>> integrating multiple data sources to produce more consistent, accurate, and >>>> useful information than that provided by any individual data source.". In >>>> the old time, we had staging tables that were used to clean and prune data >>>> from multiple sources. Nowadays we use the so-called Integration layer. If >>>> you use Spark as an ETL tool, then you have to build this validation >>>> yourself. Case in point, how to map customer_id from one source to >>>> customer_no from another. Legacy systems are full of these anomalies. MDM >>>> can help but requires human intervention which is time consuming. I am not >>>> sure the role of Spark here except being able to read the mapping tables. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Lead Solutions Architect/Engineering Lead >>>> Palantir Technologies Limited >>>> London >>>> United Kingdom >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <londonjava...@gmail.com> >>>> wrote: >>>> >>>>> Hi, Fokko and Deepak. >>>>> >>>>> The problem with DBT and Great Expectations (and Soda too, I believe) >>>>> is that by the time they find the problem, the error is already in >>>>> production - and fixing production can be a nightmare. >>>>> >>>>> What's more, we've found that nobody ever looks at the data quality >>>>> reports we already generate. >>>>> >>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but >>>>> it's usually against synthetic or at best sampled data (laws like GDPR >>>>> generally stop personal information data being anywhere but prod). >>>>> >>>>> What I'm proposing is something that stops production data ever being >>>>> tainted. >>>>> >>>>> Hi, Elliot. >>>>> >>>>> Nice to see you again (we worked together 20 years ago)! >>>>> >>>>> The problem here is that a schema itself won't protect me (at least as >>>>> I understand your argument). For instance, I have medical records that say >>>>> some of my patients are 999 years old which is clearly ridiculous but >>>>> their >>>>> age correctly conforms to an integer data type. I have other patients who >>>>> were discharged *before* they were admitted to hospital. I have 28 >>>>> patients out of literally millions who recently attended hospital but were >>>>> discharged on 1/1/1900. As you can imagine, this made the average length >>>>> of >>>>> stay (a key metric for acute hospitals) much lower than it should have >>>>> been. It only came to light when some average length of stays were >>>>> negative! >>>>> >>>>> In all these cases, the data faithfully adhered to the schema. >>>>> >>>>> Hi, Ryan. >>>>> >>>>> This is an interesting point. There *should* indeed be a human >>>>> connection but often there isn't. For instance, I have a friend who >>>>> complained that his company's Zurich office made a breaking change and was >>>>> not even aware that his London based department existed, never mind >>>>> depended on their data. In large organisations, this is pretty common. >>>>> >>>>> TBH, my proposal doesn't address this particular use case (maybe hooks >>>>> and metastore listeners would...?) But my point remains that although >>>>> these >>>>> relationships should exist, in a sufficiently large organisation, they >>>>> generally don't. And maybe we can help fix that with code? >>>>> >>>>> Would love to hear further thoughts. >>>>> >>>>> Regards, >>>>> >>>>> Phillip >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org> >>>>> wrote: >>>>> >>>>>> Hey Phillip, >>>>>> >>>>>> Thanks for raising this. I like the idea. The question is, should >>>>>> this be implemented in Spark or some other framework? I know that dbt >>>>>> has a fairly >>>>>> extensive way of testing your data >>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure that >>>>>> you can enforce assumptions on the columns. The nice thing about dbt is >>>>>> that it is built from a software engineering perspective, so all the >>>>>> tests >>>>>> (or contracts) are living in version control. Using pull requests you >>>>>> could >>>>>> collaborate on changing the contract and making sure that the change has >>>>>> gotten enough attention before pushing it to production. Hope this helps! >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma < >>>>>> deepakmc...@gmail.com>: >>>>>> >>>>>>> Spark can be used with tools like great expectations as well to >>>>>>> implement the data contracts . >>>>>>> I am not sure though if spark alone can do the data contracts . >>>>>>> I was reading a blog on data mesh and how to glue it together with >>>>>>> data contracts , that’s where I came across this spark and great >>>>>>> expectations mention . >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> -Deepak >>>>>>> >>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Phillip, >>>>>>>> >>>>>>>> While not as fine-grained as your example, there do exist schema >>>>>>>> systems such as that in Avro that can can evaluate compatible and >>>>>>>> incompatible changes to the schema, from the perspective of the reader, >>>>>>>> writer, or both. This provides some potential degree of enforcement, >>>>>>>> and >>>>>>>> means to communicate a contract. Interestingly I believe this approach >>>>>>>> has >>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent >>>>>>>> Schema registry. >>>>>>>> >>>>>>>> Elliot. >>>>>>>> >>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry < >>>>>>>> londonjava...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, folks. >>>>>>>>> >>>>>>>>> There currently seems to be a buzz around "data contracts". From >>>>>>>>> what I can tell, these mainly advocate a cultural solution. But >>>>>>>>> instead, >>>>>>>>> could big data tools be used to enforce these contracts? >>>>>>>>> >>>>>>>>> My questions really are: are there any plans to implement data >>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the >>>>>>>>> date in >>>>>>>>> column X must be before that in column Y)? And if not, is there an >>>>>>>>> appetite >>>>>>>>> for them? >>>>>>>>> >>>>>>>>> Maybe we could associate constraints with schema metadata that are >>>>>>>>> enforced in the implementation of a FileFormatDataWriter? >>>>>>>>> >>>>>>>>> Just throwing it out there and wondering what other people think. >>>>>>>>> It's an area that interests me as it seems that over half my problems >>>>>>>>> at >>>>>>>>> the day job are because of dodgy data. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Phillip >>>>>>>>> >>>>>>>>> >>>>