No worries. Have you had a chance to look at it?

Since this thread has gone dead, I assume there is no appetite for adding
data contract functionality..?

Regards,

Phillip


On Mon, 19 Jun 2023, 11:23 Deepak Sharma, <deepakmc...@gmail.com> wrote:

> Sorry for using simple in my last email .
> It’s not gonna to be simple in any terms .
> Thanks for sharing the git Philip .
> Will definitely go through it .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry <londonjava...@gmail.com>
> wrote:
>
>> I think it might be a bit more complicated than this (but happy to be
>> proved wrong).
>>
>> I have a minimum working example at:
>>
>> https://github.com/PhillHenry/SparkConstraints.git
>>
>> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
>> achieve.
>>
>> A test persists a DataFrame that conforms to the contract and
>> demonstrates that one that does not, throws an Exception.
>>
>> I've had to slightly modify 3 Spark files to add the data contract
>> functionality. If you can think of a more elegant solution, I'd be very
>> grateful.
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>>
>> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
>>> It can be as simple as adding a function to the spark session builder
>>> specifically on the read  which can take the yaml file(definition if data
>>> co tracts to be in yaml) and apply it to the data frame .
>>> It can ignore the rows not matching the data contracts defined in the
>>> yaml .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry <londonjava...@gmail.com>
>>> wrote:
>>>
>>>> For my part, I'm not too concerned about the mechanism used to
>>>> implement the validation as long as it's rich enough to express the
>>>> constraints.
>>>>
>>>> I took a look at JSON Schemas (for which there are a number of JVM
>>>> implementations) but I don't think it can handle more complex data types
>>>> like dates. Maybe Elliot can comment on this?
>>>>
>>>> Ideally, *any* reasonable mechanism could be plugged in.
>>>>
>>>> But what struck me from trying to write a Proof of Concept was that it
>>>> was quite hard to inject my code into this particular area of the Spark
>>>> machinery. It could very well be due to my limited understanding of the
>>>> codebase, but it seemed the Spark code would need a bit of a refactor
>>>> before a component could be injected. Maybe people in this forum with
>>>> greater knowledge in this area could comment?
>>>>
>>>> BTW, it's interesting to see that Databrick's "Delta Live Tables"
>>>> appear to be attempting to implement data contracts within their ecosystem.
>>>> Unfortunately, I think it's closed source and Python only.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> It would be interesting if we think about creating a contract
>>>>> validation library written in JSON format. This would ensure a validation
>>>>> mechanism that will rely on this library and could be shared among 
>>>>> relevant
>>>>> parties. Will that be a starting point?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin <j...@jgp.net> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> While I was at PayPal, we open sourced a template of Data Contract,
>>>>>> it is here: https://github.com/paypal/data-contract-template.
>>>>>> Companies like GX (Great Expectations) are interested in using it.
>>>>>>
>>>>>> Spark could read some elements form it pretty easily, like schema
>>>>>> validation, some rules validations. Spark could also generate an embryo 
>>>>>> of
>>>>>> data contracts…
>>>>>>
>>>>>> —jgp
>>>>>>
>>>>>>
>>>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> From my limited understanding of data contracts, there are two
>>>>>> factors that deem necessary.
>>>>>>
>>>>>>
>>>>>>    1. procedure matter
>>>>>>    2. technical matter
>>>>>>
>>>>>> I mean this is nothing new. Some tools like Cloud data fusion can
>>>>>> assist when the procedures are validated. Simply "The process of
>>>>>> integrating multiple data sources to produce more consistent, accurate, 
>>>>>> and
>>>>>> useful information than that provided by any individual data source.". In
>>>>>> the old time, we had staging tables that were used to clean and prune 
>>>>>> data
>>>>>> from multiple sources. Nowadays we use the so-called Integration layer. 
>>>>>> If
>>>>>> you use Spark as an ETL tool, then you have to build this validation
>>>>>> yourself. Case in point, how to map customer_id from one source to
>>>>>> customer_no from another. Legacy systems are full of these anomalies. MDM
>>>>>> can help but requires human intervention which is time consuming. I am 
>>>>>> not
>>>>>> sure the role of Spark here except being able to read the mapping tables.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry <londonjava...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Fokko and Deepak.
>>>>>>>
>>>>>>> The problem with DBT and Great Expectations (and Soda too, I
>>>>>>> believe) is that by the time they find the problem, the error is 
>>>>>>> already in
>>>>>>> production - and fixing production can be a nightmare.
>>>>>>>
>>>>>>> What's more, we've found that nobody ever looks at the data quality
>>>>>>> reports we already generate.
>>>>>>>
>>>>>>> You can, of course, run DBT, GT etc as part of a CI/CD pipeline but
>>>>>>> it's usually against synthetic or at best sampled data (laws like GDPR
>>>>>>> generally stop personal information data being anywhere but prod).
>>>>>>>
>>>>>>> What I'm proposing is something that stops production data ever
>>>>>>> being tainted.
>>>>>>>
>>>>>>> Hi, Elliot.
>>>>>>>
>>>>>>> Nice to see you again (we worked together 20 years ago)!
>>>>>>>
>>>>>>> The problem here is that a schema itself won't protect me (at least
>>>>>>> as I understand your argument). For instance, I have medical records 
>>>>>>> that
>>>>>>> say some of my patients are 999 years old which is clearly ridiculous 
>>>>>>> but
>>>>>>> their age correctly conforms to an integer data type. I have other 
>>>>>>> patients
>>>>>>> who were discharged *before* they were admitted to hospital. I have
>>>>>>> 28 patients out of literally millions who recently attended hospital but
>>>>>>> were discharged on 1/1/1900. As you can imagine, this made the average
>>>>>>> length of stay (a key metric for acute hospitals) much lower than it 
>>>>>>> should
>>>>>>> have been. It only came to light when some average length of stays were
>>>>>>> negative!
>>>>>>>
>>>>>>> In all these cases, the data faithfully adhered to the schema.
>>>>>>>
>>>>>>> Hi, Ryan.
>>>>>>>
>>>>>>> This is an interesting point. There *should* indeed be a human
>>>>>>> connection but often there isn't. For instance, I have a friend who
>>>>>>> complained that his company's Zurich office made a breaking change and 
>>>>>>> was
>>>>>>> not even aware that his London based department existed, never mind
>>>>>>> depended on their data. In large organisations, this is pretty common.
>>>>>>>
>>>>>>> TBH, my proposal doesn't address this particular use case (maybe
>>>>>>> hooks and metastore listeners would...?) But my point remains that 
>>>>>>> although
>>>>>>> these relationships should exist, in a sufficiently large organisation,
>>>>>>> they generally don't. And maybe we can help fix that with code?
>>>>>>>
>>>>>>> Would love to hear further thoughts.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Phillip
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Phillip,
>>>>>>>>
>>>>>>>> Thanks for raising this. I like the idea. The question is, should
>>>>>>>> this be implemented in Spark or some other framework? I know that dbt 
>>>>>>>> has a fairly
>>>>>>>> extensive way of testing your data
>>>>>>>> <https://www.getdbt.com/product/data-testing/>, and making sure
>>>>>>>> that you can enforce assumptions on the columns. The nice thing about 
>>>>>>>> dbt
>>>>>>>> is that it is built from a software engineering perspective, so all the
>>>>>>>> tests (or contracts) are living in version control. Using pull 
>>>>>>>> requests you
>>>>>>>> could collaborate on changing the contract and making sure that the 
>>>>>>>> change
>>>>>>>> has gotten enough attention before pushing it to production. Hope this
>>>>>>>> helps!
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Fokko
>>>>>>>>
>>>>>>>> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <
>>>>>>>> deepakmc...@gmail.com>:
>>>>>>>>
>>>>>>>>> Spark can be used with tools like great expectations as well to
>>>>>>>>> implement the data contracts .
>>>>>>>>> I am not sure though if spark alone can do the data contracts .
>>>>>>>>> I was reading a blog on data mesh and how to glue it together with
>>>>>>>>> data contracts , that’s where I came across this spark and great
>>>>>>>>> expectations mention .
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> -Deepak
>>>>>>>>>
>>>>>>>>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Phillip,
>>>>>>>>>>
>>>>>>>>>> While not as fine-grained as your example, there do exist schema
>>>>>>>>>> systems such as that in Avro that can can evaluate compatible and
>>>>>>>>>> incompatible changes to the schema, from the perspective of the 
>>>>>>>>>> reader,
>>>>>>>>>> writer, or both. This provides some potential degree of enforcement, 
>>>>>>>>>> and
>>>>>>>>>> means to communicate a contract. Interestingly I believe this 
>>>>>>>>>> approach has
>>>>>>>>>> been applied to both JsonSchema and protobuf as part of the Confluent
>>>>>>>>>> Schema registry.
>>>>>>>>>>
>>>>>>>>>> Elliot.
>>>>>>>>>>
>>>>>>>>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry <
>>>>>>>>>> londonjava...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, folks.
>>>>>>>>>>>
>>>>>>>>>>> There currently seems to be a buzz around "data contracts". From
>>>>>>>>>>> what I can tell, these mainly advocate a cultural solution. But 
>>>>>>>>>>> instead,
>>>>>>>>>>> could big data tools be used to enforce these contracts?
>>>>>>>>>>>
>>>>>>>>>>> My questions really are: are there any plans to implement data
>>>>>>>>>>> constraints in Spark (eg, an integer must be between 0 and 100; the 
>>>>>>>>>>> date in
>>>>>>>>>>> column X must be before that in column Y)? And if not, is there an 
>>>>>>>>>>> appetite
>>>>>>>>>>> for them?
>>>>>>>>>>>
>>>>>>>>>>> Maybe we could associate constraints with schema metadata that
>>>>>>>>>>> are enforced in the implementation of a FileFormatDataWriter?
>>>>>>>>>>>
>>>>>>>>>>> Just throwing it out there and wondering what other people
>>>>>>>>>>> think. It's an area that interests me as it seems that over half my
>>>>>>>>>>> problems at the day job are because of dodgy data.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Phillip
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>

Reply via email to