Re: Data Contracts

Deepak Sharma Mon, 19 Jun 2023 03:24:12 -0700

Sorry for using simple in my last email .
It’s not gonna to be simple in any terms .
Thanks for sharing the git Philip .
Will definitely go through it .


Thanks
Deepak

On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry <[email protected]>
wrote:

> I think it might be a bit more complicated than this (but happy to be
> proved wrong).
>
> I have a minimum working example at:
>
> https://github.com/PhillHenry/SparkConstraints.git
>
> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
> achieve.
>
> A test persists a DataFrame that conforms to the contract and demonstrates
> that one that does not, throws an Exception.
>
> I've had to slightly modify 3 Spark files to add the data contract
> functionality. If you can think of a more elegant solution, I'd be very
> grateful.
>
> Regards,
>
> Phillip
>
>
>
>
> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <[email protected]>
> wrote:
>
>> It can be as simple as adding a function to the spark session builder
>> specifically on the read  which can take the yaml file(definition if data
>> co tracts to be in yaml) and apply it to the data frame .
>> It can ignore the rows not matching the data contracts defined in the
>> yaml .
>>
>> Thanks
>> Deepak
>>
>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <[email protected]>
>> wrote:
>>
>>> For my part, I'm not too concerned about the mechanism used to implement
>>> the validation as long as it's rich enough to express the constraints.
>>>
>>> I took a look at JSON Schemas (for which there are a number of JVM
>>> implementations) but I don't think it can handle more complex data types
>>> like dates. Maybe Elliot can comment on this?
>>>
>>> Ideally, *any* reasonable mechanism could be plugged in.
>>>
>>> But what struck me from trying to write a Proof of Concept was that it
>>> was quite hard to inject my code into this particular area of the Spark
>>> machinery. It could very well be due to my limited understanding of the
>>> codebase, but it seemed the Spark code would need a bit of a refactor
>>> before a component could be injected. Maybe people in this forum with
>>> greater knowledge in this area could comment?
>>>
>>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>>> to be attempting to implement data contracts within their ecosystem.
>>> Unfortunately, I think it's closed source and Python only.
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> It would be interesting if we think about creating a contract
>>>> validation library written in JSON format. This would ensure a validation
>>>> mechanism that will rely on this library and could be shared among relevant
>>>> parties. Will that be a starting point?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> While I was at PayPal, we open sourced a template of Data Contract, it
>>>>> is here: https://github.com/paypal/data-contract-template. Companies
>>>>> like GX (Great Expectations) are interested in using it.
>>>>>
>>>>> Spark could read some elements form it pretty easily, like schema
>>>>> validation, some rules validations. Spark could also generate an embryo of
>>>>> data contracts…
>>>>>
>>>>> —jgp
>>>>>
>>>>>
>>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <[email protected]>
>>>>> wrote:
>>>>>
>>>>> From my limited understanding of data contracts, there are two factors
>>>>> that deem necessary.
>>>>>
>>>>>
>>>>>    1. procedure matter
>>>>>    2. technical matter
>>>>>
>>>>> I mean this is nothing new. Some tools like Cloud data fusion can
>>>>> assist when the procedures are validated. Simply "The process of
>>>>> integrating multiple data sources to produce more consistent, accurate, 
>>>>> and
>>>>> useful information than that provided by any individual data source.". In
>>>>> the old time, we had staging tables that were used to clean and prune data
>>>>> from multiple sources. Nowadays we use the so-called Integration layer. If
>>>>> you use Spark as an ETL tool, then you have to build this validation
>>>>> yourself. Case in point, how to map customer_id from one source to
>>>>> customer_no from another. Legacy systems are full of these anomalies. MDM
>>>>> can help but requires human intervention which is time consuming. I am not
>>>>> sure the role of Spark here except being able to read the mapping tables.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Fokko and Deepak.
>>>>>>
>>>>>> The problem with DBT and Great Expectations (and Soda too, I believe)
>>>>>> is that by the time they find the problem, the error is already in
>>>>>> production - and fixing production can be a nightmare.
>>>>>>
>>>>>> What's more, we've found that nobody ever looks at the data quality
>>>>>> reports we already generate.
>>>>>>
>>>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but
>>>>>> it's usually against synthetic or at best sampled data (laws like GDPR
>>>>>> generally stop personal information data being anywhere but prod).
>>>>>>
>>>>>> What I'm proposing is something that stops production data ever being
>>>>>> tainted.
>>>>>>
>>>>>> Hi, Elliot.
>>>>>>
>>>>>> Nice to see you again (we worked together 20 years ago)!
>>>>>>
>>>>>> The problem here is that a schema itself won't protect me (at least
>>>>>> as I understand your argument). For instance, I have medical records that
>>>>>> say some of my patients are 999 years old which is clearly ridiculous but
>>>>>> their age correctly conforms to an integer data type. I have other 
>>>>>> patients
>>>>>> who were discharged *before* they were admitted to hospital. I have
>>>>>> 28 patients out of literally millions who recently attended hospital but
>>>>>> were discharged on 1/1/1900. As you can imagine, this made the average
>>>>>> length of stay (a key metric for acute hospitals) much lower than it 
>>>>>> should
>>>>>> have been. It only came to light when some average length of stays were
>>>>>> negative!
>>>>>>
>>>>>> In all these cases, the data faithfully adhered to the schema.
>>>>>>
>>>>>> Hi, Ryan.
>>>>>>
>>>>>> This is an interesting point. There *should* indeed be a human
>>>>>> connection but often there isn't. For instance, I have a friend who
>>>>>> complained that his company's Zurich office made a breaking change and 
>>>>>> was
>>>>>> not even aware that his London based department existed, never mind
>>>>>> depended on their data. In large organisations, this is pretty common.
>>>>>>
>>>>>> TBH, my proposal doesn't address this particular use case (maybe
>>>>>> hooks and metastore listeners would...?) But my point remains that 
>>>>>> although
>>>>>> these relationships should exist, in a sufficiently large organisation,
>>>>>> they generally don't. And maybe we can help fix that with code?
>>>>>>
>>>>>> Would love to hear further thoughts.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Phillip,
>>>>>>>
>>>>>>> Thanks for raising this. I like the idea. The question is, should
>>>>>>> this be implemented in Spark or some other framework? I know that dbt 
>>>>>>> has a fairly
>>>>>>> extensive way of testing your data
>>>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure
>>>>>>> that you can enforce assumptions on the columns. The nice thing about 
>>>>>>> dbt
>>>>>>> is that it is built from a software engineering perspective, so all the
>>>>>>> tests (or contracts) are living in version control. Using pull requests 
>>>>>>> you
>>>>>>> could collaborate on changing the contract and making sure that the 
>>>>>>> change
>>>>>>> has gotten enough attention before pushing it to production. Hope this
>>>>>>> helps!
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>> Spark can be used with tools like great expectations as well to
>>>>>>>> implement the data contracts .
>>>>>>>> I am not sure though if spark alone can do the data contracts .
>>>>>>>> I was reading a blog on data mesh and how to glue it together with
>>>>>>>> data contracts , that’s where I came across this spark and great
>>>>>>>> expectations mention .
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> -Deepak
>>>>>>>>
>>>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Phillip,
>>>>>>>>>
>>>>>>>>> While not as fine-grained as your example, there do exist schema
>>>>>>>>> systems such as that in Avro that can can evaluate compatible and
>>>>>>>>> incompatible changes to the schema, from the perspective of the 
>>>>>>>>> reader,
>>>>>>>>> writer, or both. This provides some potential degree of enforcement, 
>>>>>>>>> and
>>>>>>>>> means to communicate a contract. Interestingly I believe this 
>>>>>>>>> approach has
>>>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent
>>>>>>>>> Schema registry.
>>>>>>>>>
>>>>>>>>> Elliot.
>>>>>>>>>
>>>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, folks.
>>>>>>>>>>
>>>>>>>>>> There currently seems to be a buzz around "data contracts". From
>>>>>>>>>> what I can tell, these mainly advocate a cultural solution. But 
>>>>>>>>>> instead,
>>>>>>>>>> could big data tools be used to enforce these contracts?
>>>>>>>>>>
>>>>>>>>>> My questions really are: are there any plans to implement data
>>>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the 
>>>>>>>>>> date in
>>>>>>>>>> column X must be before that in column Y)? And if not, is there an 
>>>>>>>>>> appetite
>>>>>>>>>> for them?
>>>>>>>>>>
>>>>>>>>>> Maybe we could associate constraints with schema metadata that
>>>>>>>>>> are enforced in the implementation of a FileFormatDataWriter?
>>>>>>>>>>
>>>>>>>>>> Just throwing it out there and wondering what other people think.
>>>>>>>>>> It's an area that interests me as it seems that over half my 
>>>>>>>>>> problems at
>>>>>>>>>> the day job are because of dodgy data.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Phillip
>>>>>>>>>>
>>>>>>>>>>
>>>>>

Re: Data Contracts

Reply via email to