Sorry for using simple in my last email . It’s not gonna to be simple in any terms . Thanks for sharing the git Philip . Will definitely go through it .
Thanks Deepak On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry <londonjava...@gmail.com> wrote: > I think it might be a bit more complicated than this (but happy to be > proved wrong). > > I have a minimum working example at: > > https://github.com/PhillHenry/SparkConstraints.git > > that runs out-of-the-box (mvn test) and demonstrates what I am trying to > achieve. > > A test persists a DataFrame that conforms to the contract and demonstrates > that one that does not, throws an Exception. > > I've had to slightly modify 3 Spark files to add the data contract > functionality. If you can think of a more elegant solution, I'd be very > grateful. > > Regards, > > Phillip > > > > > On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> It can be as simple as adding a function to the spark session builder >> specifically on the read which can take the yaml file(definition if data >> co tracts to be in yaml) and apply it to the data frame . >> It can ignore the rows not matching the data contracts defined in the >> yaml . >> >> Thanks >> Deepak >> >> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <londonjava...@gmail.com> >> wrote: >> >>> For my part, I'm not too concerned about the mechanism used to implement >>> the validation as long as it's rich enough to express the constraints. >>> >>> I took a look at JSON Schemas (for which there are a number of JVM >>> implementations) but I don't think it can handle more complex data types >>> like dates. Maybe Elliot can comment on this? >>> >>> Ideally, *any* reasonable mechanism could be plugged in. >>> >>> But what struck me from trying to write a Proof of Concept was that it >>> was quite hard to inject my code into this particular area of the Spark >>> machinery. It could very well be due to my limited understanding of the >>> codebase, but it seemed the Spark code would need a bit of a refactor >>> before a component could be injected. Maybe people in this forum with >>> greater knowledge in this area could comment? >>> >>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear >>> to be attempting to implement data contracts within their ecosystem. >>> Unfortunately, I think it's closed source and Python only. >>> >>> Regards, >>> >>> Phillip >>> >>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> It would be interesting if we think about creating a contract >>>> validation library written in JSON format. This would ensure a validation >>>> mechanism that will rely on this library and could be shared among relevant >>>> parties. Will that be a starting point? >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Lead Solutions Architect/Engineering Lead >>>> Palantir Technologies Limited >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <j...@jgp.net> wrote: >>>> >>>>> Hi, >>>>> >>>>> While I was at PayPal, we open sourced a template of Data Contract, it >>>>> is here: https://github.com/paypal/data-contract-template. Companies >>>>> like GX (Great Expectations) are interested in using it. >>>>> >>>>> Spark could read some elements form it pretty easily, like schema >>>>> validation, some rules validations. Spark could also generate an embryo of >>>>> data contracts… >>>>> >>>>> —jgp >>>>> >>>>> >>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> From my limited understanding of data contracts, there are two factors >>>>> that deem necessary. >>>>> >>>>> >>>>> 1. procedure matter >>>>> 2. technical matter >>>>> >>>>> I mean this is nothing new. Some tools like Cloud data fusion can >>>>> assist when the procedures are validated. Simply "The process of >>>>> integrating multiple data sources to produce more consistent, accurate, >>>>> and >>>>> useful information than that provided by any individual data source.". In >>>>> the old time, we had staging tables that were used to clean and prune data >>>>> from multiple sources. Nowadays we use the so-called Integration layer. If >>>>> you use Spark as an ETL tool, then you have to build this validation >>>>> yourself. Case in point, how to map customer_id from one source to >>>>> customer_no from another. Legacy systems are full of these anomalies. MDM >>>>> can help but requires human intervention which is time consuming. I am not >>>>> sure the role of Spark here except being able to read the mapping tables. >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Lead Solutions Architect/Engineering Lead >>>>> Palantir Technologies Limited >>>>> London >>>>> United Kingdom >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <londonjava...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, Fokko and Deepak. >>>>>> >>>>>> The problem with DBT and Great Expectations (and Soda too, I believe) >>>>>> is that by the time they find the problem, the error is already in >>>>>> production - and fixing production can be a nightmare. >>>>>> >>>>>> What's more, we've found that nobody ever looks at the data quality >>>>>> reports we already generate. >>>>>> >>>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but >>>>>> it's usually against synthetic or at best sampled data (laws like GDPR >>>>>> generally stop personal information data being anywhere but prod). >>>>>> >>>>>> What I'm proposing is something that stops production data ever being >>>>>> tainted. >>>>>> >>>>>> Hi, Elliot. >>>>>> >>>>>> Nice to see you again (we worked together 20 years ago)! >>>>>> >>>>>> The problem here is that a schema itself won't protect me (at least >>>>>> as I understand your argument). For instance, I have medical records that >>>>>> say some of my patients are 999 years old which is clearly ridiculous but >>>>>> their age correctly conforms to an integer data type. I have other >>>>>> patients >>>>>> who were discharged *before* they were admitted to hospital. I have >>>>>> 28 patients out of literally millions who recently attended hospital but >>>>>> were discharged on 1/1/1900. As you can imagine, this made the average >>>>>> length of stay (a key metric for acute hospitals) much lower than it >>>>>> should >>>>>> have been. It only came to light when some average length of stays were >>>>>> negative! >>>>>> >>>>>> In all these cases, the data faithfully adhered to the schema. >>>>>> >>>>>> Hi, Ryan. >>>>>> >>>>>> This is an interesting point. There *should* indeed be a human >>>>>> connection but often there isn't. For instance, I have a friend who >>>>>> complained that his company's Zurich office made a breaking change and >>>>>> was >>>>>> not even aware that his London based department existed, never mind >>>>>> depended on their data. In large organisations, this is pretty common. >>>>>> >>>>>> TBH, my proposal doesn't address this particular use case (maybe >>>>>> hooks and metastore listeners would...?) But my point remains that >>>>>> although >>>>>> these relationships should exist, in a sufficiently large organisation, >>>>>> they generally don't. And maybe we can help fix that with code? >>>>>> >>>>>> Would love to hear further thoughts. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Phillip >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hey Phillip, >>>>>>> >>>>>>> Thanks for raising this. I like the idea. The question is, should >>>>>>> this be implemented in Spark or some other framework? I know that dbt >>>>>>> has a fairly >>>>>>> extensive way of testing your data >>>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure >>>>>>> that you can enforce assumptions on the columns. The nice thing about >>>>>>> dbt >>>>>>> is that it is built from a software engineering perspective, so all the >>>>>>> tests (or contracts) are living in version control. Using pull requests >>>>>>> you >>>>>>> could collaborate on changing the contract and making sure that the >>>>>>> change >>>>>>> has gotten enough attention before pushing it to production. Hope this >>>>>>> helps! >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma < >>>>>>> deepakmc...@gmail.com>: >>>>>>> >>>>>>>> Spark can be used with tools like great expectations as well to >>>>>>>> implement the data contracts . >>>>>>>> I am not sure though if spark alone can do the data contracts . >>>>>>>> I was reading a blog on data mesh and how to glue it together with >>>>>>>> data contracts , that’s where I came across this spark and great >>>>>>>> expectations mention . >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> -Deepak >>>>>>>> >>>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Phillip, >>>>>>>>> >>>>>>>>> While not as fine-grained as your example, there do exist schema >>>>>>>>> systems such as that in Avro that can can evaluate compatible and >>>>>>>>> incompatible changes to the schema, from the perspective of the >>>>>>>>> reader, >>>>>>>>> writer, or both. This provides some potential degree of enforcement, >>>>>>>>> and >>>>>>>>> means to communicate a contract. Interestingly I believe this >>>>>>>>> approach has >>>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent >>>>>>>>> Schema registry. >>>>>>>>> >>>>>>>>> Elliot. >>>>>>>>> >>>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry < >>>>>>>>> londonjava...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, folks. >>>>>>>>>> >>>>>>>>>> There currently seems to be a buzz around "data contracts". From >>>>>>>>>> what I can tell, these mainly advocate a cultural solution. But >>>>>>>>>> instead, >>>>>>>>>> could big data tools be used to enforce these contracts? >>>>>>>>>> >>>>>>>>>> My questions really are: are there any plans to implement data >>>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the >>>>>>>>>> date in >>>>>>>>>> column X must be before that in column Y)? And if not, is there an >>>>>>>>>> appetite >>>>>>>>>> for them? >>>>>>>>>> >>>>>>>>>> Maybe we could associate constraints with schema metadata that >>>>>>>>>> are enforced in the implementation of a FileFormatDataWriter? >>>>>>>>>> >>>>>>>>>> Just throwing it out there and wondering what other people think. >>>>>>>>>> It's an area that interests me as it seems that over half my >>>>>>>>>> problems at >>>>>>>>>> the day job are because of dodgy data. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Phillip >>>>>>>>>> >>>>>>>>>> >>>>>