No worries. Have you had a chance to look at it? Since this thread has gone dead, I assume there is no appetite for adding data contract functionality..?
Regards, Phillip On Mon, 19 Jun 2023, 11:23 Deepak Sharma, <deepakmc...@gmail.com> wrote: > Sorry for using simple in my last email . > It’s not gonna to be simple in any terms . > Thanks for sharing the git Philip . > Will definitely go through it . > > Thanks > Deepak > > On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry <londonjava...@gmail.com> > wrote: > >> I think it might be a bit more complicated than this (but happy to be >> proved wrong). >> >> I have a minimum working example at: >> >> https://github.com/PhillHenry/SparkConstraints.git >> >> that runs out-of-the-box (mvn test) and demonstrates what I am trying to >> achieve. >> >> A test persists a DataFrame that conforms to the contract and >> demonstrates that one that does not, throws an Exception. >> >> I've had to slightly modify 3 Spark files to add the data contract >> functionality. If you can think of a more elegant solution, I'd be very >> grateful. >> >> Regards, >> >> Phillip >> >> >> >> >> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <deepakmc...@gmail.com> >> wrote: >> >>> It can be as simple as adding a function to the spark session builder >>> specifically on the read which can take the yaml file(definition if data >>> co tracts to be in yaml) and apply it to the data frame . >>> It can ignore the rows not matching the data contracts defined in the >>> yaml . >>> >>> Thanks >>> Deepak >>> >>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <londonjava...@gmail.com> >>> wrote: >>> >>>> For my part, I'm not too concerned about the mechanism used to >>>> implement the validation as long as it's rich enough to express the >>>> constraints. >>>> >>>> I took a look at JSON Schemas (for which there are a number of JVM >>>> implementations) but I don't think it can handle more complex data types >>>> like dates. Maybe Elliot can comment on this? >>>> >>>> Ideally, *any* reasonable mechanism could be plugged in. >>>> >>>> But what struck me from trying to write a Proof of Concept was that it >>>> was quite hard to inject my code into this particular area of the Spark >>>> machinery. It could very well be due to my limited understanding of the >>>> codebase, but it seemed the Spark code would need a bit of a refactor >>>> before a component could be injected. Maybe people in this forum with >>>> greater knowledge in this area could comment? >>>> >>>> BTW, it's interesting to see that Databrick's "Delta Live Tables" >>>> appear to be attempting to implement data contracts within their ecosystem. >>>> Unfortunately, I think it's closed source and Python only. >>>> >>>> Regards, >>>> >>>> Phillip >>>> >>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> It would be interesting if we think about creating a contract >>>>> validation library written in JSON format. This would ensure a validation >>>>> mechanism that will rely on this library and could be shared among >>>>> relevant >>>>> parties. Will that be a starting point? >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Lead Solutions Architect/Engineering Lead >>>>> Palantir Technologies Limited >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <j...@jgp.net> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> While I was at PayPal, we open sourced a template of Data Contract, >>>>>> it is here: https://github.com/paypal/data-contract-template. >>>>>> Companies like GX (Great Expectations) are interested in using it. >>>>>> >>>>>> Spark could read some elements form it pretty easily, like schema >>>>>> validation, some rules validations. Spark could also generate an embryo >>>>>> of >>>>>> data contracts… >>>>>> >>>>>> —jgp >>>>>> >>>>>> >>>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> From my limited understanding of data contracts, there are two >>>>>> factors that deem necessary. >>>>>> >>>>>> >>>>>> 1. procedure matter >>>>>> 2. technical matter >>>>>> >>>>>> I mean this is nothing new. Some tools like Cloud data fusion can >>>>>> assist when the procedures are validated. Simply "The process of >>>>>> integrating multiple data sources to produce more consistent, accurate, >>>>>> and >>>>>> useful information than that provided by any individual data source.". In >>>>>> the old time, we had staging tables that were used to clean and prune >>>>>> data >>>>>> from multiple sources. Nowadays we use the so-called Integration layer. >>>>>> If >>>>>> you use Spark as an ETL tool, then you have to build this validation >>>>>> yourself. Case in point, how to map customer_id from one source to >>>>>> customer_no from another. Legacy systems are full of these anomalies. MDM >>>>>> can help but requires human intervention which is time consuming. I am >>>>>> not >>>>>> sure the role of Spark here except being able to read the mapping tables. >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Lead Solutions Architect/Engineering Lead >>>>>> Palantir Technologies Limited >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <londonjava...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, Fokko and Deepak. >>>>>>> >>>>>>> The problem with DBT and Great Expectations (and Soda too, I >>>>>>> believe) is that by the time they find the problem, the error is >>>>>>> already in >>>>>>> production - and fixing production can be a nightmare. >>>>>>> >>>>>>> What's more, we've found that nobody ever looks at the data quality >>>>>>> reports we already generate. >>>>>>> >>>>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but >>>>>>> it's usually against synthetic or at best sampled data (laws like GDPR >>>>>>> generally stop personal information data being anywhere but prod). >>>>>>> >>>>>>> What I'm proposing is something that stops production data ever >>>>>>> being tainted. >>>>>>> >>>>>>> Hi, Elliot. >>>>>>> >>>>>>> Nice to see you again (we worked together 20 years ago)! >>>>>>> >>>>>>> The problem here is that a schema itself won't protect me (at least >>>>>>> as I understand your argument). For instance, I have medical records >>>>>>> that >>>>>>> say some of my patients are 999 years old which is clearly ridiculous >>>>>>> but >>>>>>> their age correctly conforms to an integer data type. I have other >>>>>>> patients >>>>>>> who were discharged *before* they were admitted to hospital. I have >>>>>>> 28 patients out of literally millions who recently attended hospital but >>>>>>> were discharged on 1/1/1900. As you can imagine, this made the average >>>>>>> length of stay (a key metric for acute hospitals) much lower than it >>>>>>> should >>>>>>> have been. It only came to light when some average length of stays were >>>>>>> negative! >>>>>>> >>>>>>> In all these cases, the data faithfully adhered to the schema. >>>>>>> >>>>>>> Hi, Ryan. >>>>>>> >>>>>>> This is an interesting point. There *should* indeed be a human >>>>>>> connection but often there isn't. For instance, I have a friend who >>>>>>> complained that his company's Zurich office made a breaking change and >>>>>>> was >>>>>>> not even aware that his London based department existed, never mind >>>>>>> depended on their data. In large organisations, this is pretty common. >>>>>>> >>>>>>> TBH, my proposal doesn't address this particular use case (maybe >>>>>>> hooks and metastore listeners would...?) But my point remains that >>>>>>> although >>>>>>> these relationships should exist, in a sufficiently large organisation, >>>>>>> they generally don't. And maybe we can help fix that with code? >>>>>>> >>>>>>> Would love to hear further thoughts. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Phillip >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Phillip, >>>>>>>> >>>>>>>> Thanks for raising this. I like the idea. The question is, should >>>>>>>> this be implemented in Spark or some other framework? I know that dbt >>>>>>>> has a fairly >>>>>>>> extensive way of testing your data >>>>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure >>>>>>>> that you can enforce assumptions on the columns. The nice thing about >>>>>>>> dbt >>>>>>>> is that it is built from a software engineering perspective, so all the >>>>>>>> tests (or contracts) are living in version control. Using pull >>>>>>>> requests you >>>>>>>> could collaborate on changing the contract and making sure that the >>>>>>>> change >>>>>>>> has gotten enough attention before pushing it to production. Hope this >>>>>>>> helps! >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Fokko >>>>>>>> >>>>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma < >>>>>>>> deepakmc...@gmail.com>: >>>>>>>> >>>>>>>>> Spark can be used with tools like great expectations as well to >>>>>>>>> implement the data contracts . >>>>>>>>> I am not sure though if spark alone can do the data contracts . >>>>>>>>> I was reading a blog on data mesh and how to glue it together with >>>>>>>>> data contracts , that’s where I came across this spark and great >>>>>>>>> expectations mention . >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> -Deepak >>>>>>>>> >>>>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Phillip, >>>>>>>>>> >>>>>>>>>> While not as fine-grained as your example, there do exist schema >>>>>>>>>> systems such as that in Avro that can can evaluate compatible and >>>>>>>>>> incompatible changes to the schema, from the perspective of the >>>>>>>>>> reader, >>>>>>>>>> writer, or both. This provides some potential degree of enforcement, >>>>>>>>>> and >>>>>>>>>> means to communicate a contract. Interestingly I believe this >>>>>>>>>> approach has >>>>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent >>>>>>>>>> Schema registry. >>>>>>>>>> >>>>>>>>>> Elliot. >>>>>>>>>> >>>>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry < >>>>>>>>>> londonjava...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, folks. >>>>>>>>>>> >>>>>>>>>>> There currently seems to be a buzz around "data contracts". From >>>>>>>>>>> what I can tell, these mainly advocate a cultural solution. But >>>>>>>>>>> instead, >>>>>>>>>>> could big data tools be used to enforce these contracts? >>>>>>>>>>> >>>>>>>>>>> My questions really are: are there any plans to implement data >>>>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the >>>>>>>>>>> date in >>>>>>>>>>> column X must be before that in column Y)? And if not, is there an >>>>>>>>>>> appetite >>>>>>>>>>> for them? >>>>>>>>>>> >>>>>>>>>>> Maybe we could associate constraints with schema metadata that >>>>>>>>>>> are enforced in the implementation of a FileFormatDataWriter? >>>>>>>>>>> >>>>>>>>>>> Just throwing it out there and wondering what other people >>>>>>>>>>> think. It's an area that interests me as it seems that over half my >>>>>>>>>>> problems at the day job are because of dodgy data. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> >>>>>>>>>>> Phillip >>>>>>>>>>> >>>>>>>>>>> >>>>>>