I think it might be a bit more complicated than this (but happy to be
proved wrong).

I have a minimum working example at:

https://github.com/PhillHenry/SparkConstraints.git

that runs out-of-the-box (mvn test) and demonstrates what I am trying to
achieve.

A test persists a DataFrame that conforms to the contract and demonstrates
that one that does not, throws an Exception.

I've had to slightly modify 3 Spark files to add the data contract
functionality. If you can think of a more elegant solution, I'd be very
grateful.

Regards,

Phillip




On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <deepakmc...@gmail.com> wrote:

> It can be as simple as adding a function to the spark session builder
> specifically on the read  which can take the yaml file(definition if data
> co tracts to be in yaml) and apply it to the data frame .
> It can ignore the rows not matching the data contracts defined in the yaml
> .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <londonjava...@gmail.com>
> wrote:
>
>> For my part, I'm not too concerned about the mechanism used to implement
>> the validation as long as it's rich enough to express the constraints.
>>
>> I took a look at JSON Schemas (for which there are a number of JVM
>> implementations) but I don't think it can handle more complex data types
>> like dates. Maybe Elliot can comment on this?
>>
>> Ideally, *any* reasonable mechanism could be plugged in.
>>
>> But what struck me from trying to write a Proof of Concept was that it
>> was quite hard to inject my code into this particular area of the Spark
>> machinery. It could very well be due to my limited understanding of the
>> codebase, but it seemed the Spark code would need a bit of a refactor
>> before a component could be injected. Maybe people in this forum with
>> greater knowledge in this area could comment?
>>
>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>> to be attempting to implement data contracts within their ecosystem.
>> Unfortunately, I think it's closed source and Python only.
>>
>> Regards,
>>
>> Phillip
>>
>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It would be interesting if we think about creating a contract validation
>>> library written in JSON format. This would ensure a validation mechanism
>>> that will rely on this library and could be shared among relevant parties.
>>> Will that be a starting point?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <j...@jgp.net> wrote:
>>>
>>>> Hi,
>>>>
>>>> While I was at PayPal, we open sourced a template of Data Contract, it
>>>> is here: https://github.com/paypal/data-contract-template. Companies
>>>> like GX (Great Expectations) are interested in using it.
>>>>
>>>> Spark could read some elements form it pretty easily, like schema
>>>> validation, some rules validations. Spark could also generate an embryo of
>>>> data contracts…
>>>>
>>>> —jgp
>>>>
>>>>
>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> From my limited understanding of data contracts, there are two factors
>>>> that deem necessary.
>>>>
>>>>
>>>>    1. procedure matter
>>>>    2. technical matter
>>>>
>>>> I mean this is nothing new. Some tools like Cloud data fusion can
>>>> assist when the procedures are validated. Simply "The process of
>>>> integrating multiple data sources to produce more consistent, accurate, and
>>>> useful information than that provided by any individual data source.". In
>>>> the old time, we had staging tables that were used to clean and prune data
>>>> from multiple sources. Nowadays we use the so-called Integration layer. If
>>>> you use Spark as an ETL tool, then you have to build this validation
>>>> yourself. Case in point, how to map customer_id from one source to
>>>> customer_no from another. Legacy systems are full of these anomalies. MDM
>>>> can help but requires human intervention which is time consuming. I am not
>>>> sure the role of Spark here except being able to read the mapping tables.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <londonjava...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, Fokko and Deepak.
>>>>>
>>>>> The problem with DBT and Great Expectations (and Soda too, I believe)
>>>>> is that by the time they find the problem, the error is already in
>>>>> production - and fixing production can be a nightmare.
>>>>>
>>>>> What's more, we've found that nobody ever looks at the data quality
>>>>> reports we already generate.
>>>>>
>>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but
>>>>> it's usually against synthetic or at best sampled data (laws like GDPR
>>>>> generally stop personal information data being anywhere but prod).
>>>>>
>>>>> What I'm proposing is something that stops production data ever being
>>>>> tainted.
>>>>>
>>>>> Hi, Elliot.
>>>>>
>>>>> Nice to see you again (we worked together 20 years ago)!
>>>>>
>>>>> The problem here is that a schema itself won't protect me (at least as
>>>>> I understand your argument). For instance, I have medical records that say
>>>>> some of my patients are 999 years old which is clearly ridiculous but 
>>>>> their
>>>>> age correctly conforms to an integer data type. I have other patients who
>>>>> were discharged *before* they were admitted to hospital. I have 28
>>>>> patients out of literally millions who recently attended hospital but were
>>>>> discharged on 1/1/1900. As you can imagine, this made the average length 
>>>>> of
>>>>> stay (a key metric for acute hospitals) much lower than it should have
>>>>> been. It only came to light when some average length of stays were
>>>>> negative!
>>>>>
>>>>> In all these cases, the data faithfully adhered to the schema.
>>>>>
>>>>> Hi, Ryan.
>>>>>
>>>>> This is an interesting point. There *should* indeed be a human
>>>>> connection but often there isn't. For instance, I have a friend who
>>>>> complained that his company's Zurich office made a breaking change and was
>>>>> not even aware that his London based department existed, never mind
>>>>> depended on their data. In large organisations, this is pretty common.
>>>>>
>>>>> TBH, my proposal doesn't address this particular use case (maybe hooks
>>>>> and metastore listeners would...?) But my point remains that although 
>>>>> these
>>>>> relationships should exist, in a sufficiently large organisation, they
>>>>> generally don't. And maybe we can help fix that with code?
>>>>>
>>>>> Would love to hear further thoughts.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Phillip
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hey Phillip,
>>>>>>
>>>>>> Thanks for raising this. I like the idea. The question is, should
>>>>>> this be implemented in Spark or some other framework? I know that dbt 
>>>>>> has a fairly
>>>>>> extensive way of testing your data
>>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure that
>>>>>> you can enforce assumptions on the columns. The nice thing about dbt is
>>>>>> that it is built from a software engineering perspective, so all the 
>>>>>> tests
>>>>>> (or contracts) are living in version control. Using pull requests you 
>>>>>> could
>>>>>> collaborate on changing the contract and making sure that the change has
>>>>>> gotten enough attention before pushing it to production. Hope this helps!
>>>>>>
>>>>>> Kind regards,
>>>>>> Fokko
>>>>>>
>>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <
>>>>>> deepakmc...@gmail.com>:
>>>>>>
>>>>>>> Spark can be used with tools like great expectations as well to
>>>>>>> implement the data contracts .
>>>>>>> I am not sure though if spark alone can do the data contracts .
>>>>>>> I was reading a blog on data mesh and how to glue it together with
>>>>>>> data contracts , that’s where I came across this spark and great
>>>>>>> expectations mention .
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> -Deepak
>>>>>>>
>>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Phillip,
>>>>>>>>
>>>>>>>> While not as fine-grained as your example, there do exist schema
>>>>>>>> systems such as that in Avro that can can evaluate compatible and
>>>>>>>> incompatible changes to the schema, from the perspective of the reader,
>>>>>>>> writer, or both. This provides some potential degree of enforcement, 
>>>>>>>> and
>>>>>>>> means to communicate a contract. Interestingly I believe this approach 
>>>>>>>> has
>>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent
>>>>>>>> Schema registry.
>>>>>>>>
>>>>>>>> Elliot.
>>>>>>>>
>>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry <
>>>>>>>> londonjava...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, folks.
>>>>>>>>>
>>>>>>>>> There currently seems to be a buzz around "data contracts". From
>>>>>>>>> what I can tell, these mainly advocate a cultural solution. But 
>>>>>>>>> instead,
>>>>>>>>> could big data tools be used to enforce these contracts?
>>>>>>>>>
>>>>>>>>> My questions really are: are there any plans to implement data
>>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the 
>>>>>>>>> date in
>>>>>>>>> column X must be before that in column Y)? And if not, is there an 
>>>>>>>>> appetite
>>>>>>>>> for them?
>>>>>>>>>
>>>>>>>>> Maybe we could associate constraints with schema metadata that are
>>>>>>>>> enforced in the implementation of a FileFormatDataWriter?
>>>>>>>>>
>>>>>>>>> Just throwing it out there and wondering what other people think.
>>>>>>>>> It's an area that interests me as it seems that over half my problems 
>>>>>>>>> at
>>>>>>>>> the day job are because of dodgy data.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Phillip
>>>>>>>>>
>>>>>>>>>
>>>>

Reply via email to