Hi, Fokko and Deepak.

The problem with DBT and Great Expectations (and Soda too, I believe) is
that by the time they find the problem, the error is already in production
- and fixing production can be a nightmare.

What's more, we've found that nobody ever looks at the data quality reports
we already generate.

You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's
usually against synthetic or at best sampled data (laws like GDPR generally
stop personal information data being anywhere but prod).

What I'm proposing is something that stops production data ever being
tainted.

Hi, Elliot.

Nice to see you again (we worked together 20 years ago)!

The problem here is that a schema itself won't protect me (at least as I
understand your argument). For instance, I have medical records that say
some of my patients are 999 years old which is clearly ridiculous but their
age correctly conforms to an integer data type. I have other patients who
were discharged *before* they were admitted to hospital. I have 28 patients
out of literally millions who recently attended hospital but were
discharged on 1/1/1900. As you can imagine, this made the average length of
stay (a key metric for acute hospitals) much lower than it should have
been. It only came to light when some average length of stays were
negative!

In all these cases, the data faithfully adhered to the schema.

Hi, Ryan.

This is an interesting point. There *should* indeed be a human connection
but often there isn't. For instance, I have a friend who complained that
his company's Zurich office made a breaking change and was not even aware
that his London based department existed, never mind depended on their
data. In large organisations, this is pretty common.

TBH, my proposal doesn't address this particular use case (maybe hooks and
metastore listeners would...?) But my point remains that although these
relationships should exist, in a sufficiently large organisation, they
generally don't. And maybe we can help fix that with code?

Would love to hear further thoughts.

Regards,

Phillip





On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Phillip,
>
> Thanks for raising this. I like the idea. The question is, should this be
> implemented in Spark or some other framework? I know that dbt has a fairly
> extensive way of testing your data
> <https://www.getdbt.com/product/data-testing/>, and making sure that you
> can enforce assumptions on the columns. The nice thing about dbt is that it
> is built from a software engineering perspective, so all the tests (or
> contracts) are living in version control. Using pull requests you could
> collaborate on changing the contract and making sure that the change has
> gotten enough attention before pushing it to production. Hope this helps!
>
> Kind regards,
> Fokko
>
> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <deepakmc...@gmail.com>:
>
>> Spark can be used with tools like great expectations as well to implement
>> the data contracts .
>> I am not sure though if spark alone can do the data contracts .
>> I was reading a blog on data mesh and how to glue it together with data
>> contracts , that’s where I came across this spark and great expectations
>> mention .
>>
>> HTH
>>
>> -Deepak
>>
>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com> wrote:
>>
>>> Hi Phillip,
>>>
>>> While not as fine-grained as your example, there do exist schema systems
>>> such as that in Avro that can can evaluate compatible and incompatible
>>> changes to the schema, from the perspective of the reader, writer, or both.
>>> This provides some potential degree of enforcement, and means to
>>> communicate a contract. Interestingly I believe this approach has been
>>> applied to both JsonSchema and protobuf as part of the Confluent Schema
>>> registry.
>>>
>>> Elliot.
>>>
>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry <londonjava...@gmail.com>
>>> wrote:
>>>
>>>> Hi, folks.
>>>>
>>>> There currently seems to be a buzz around "data contracts". From what I
>>>> can tell, these mainly advocate a cultural solution. But instead, could big
>>>> data tools be used to enforce these contracts?
>>>>
>>>> My questions really are: are there any plans to implement data
>>>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>>>> column X must be before that in column Y)? And if not, is there an appetite
>>>> for them?
>>>>
>>>> Maybe we could associate constraints with schema metadata that are
>>>> enforced in the implementation of a FileFormatDataWriter?
>>>>
>>>> Just throwing it out there and wondering what other people think. It's
>>>> an area that interests me as it seems that over half my problems at the day
>>>> job are because of dodgy data.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>

Reply via email to