Re: Adding Custom finalize method to RDDs.

2019-06-13 Thread Phillip Henry
If you control the codebase, you control when an RDD goes out of scope. Or
am I missing something?

(Note that finalize will not necessarily executed when an object goes out
of scope but when the GC runs at some indeterminate point in the future.
Please avoid using finalize for the kind of task you're trying to do. It's
not what it was designed for. Better to pay more attention to
house-keeping in your own code.)



On Wed, Jun 12, 2019 at 9:11 PM Nasrulla Khan Haris <
nasrulla.k...@microsoft.com> wrote:

> We cannot have control over RDD going out of scope from memory as it was
> handled by JVM. Thus I am not sure try and finalize will help.
>
>  Thus I wanted to use some mechanism to cleanup of some temporary data
> which is created by RDD immediately as soon as it goes out of scope.
>
>
>
> Any ideas ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Phillip Henry 
> *Sent:* Tuesday, June 11, 2019 11:28 PM
> *To:* Nasrulla Khan Haris 
> *Cc:* Vinoo Ganesh ; dev@spark.apache.org
> *Subject:* Re: Adding Custom finalize method to RDDs.
>
>
>
> That's not the kind of thing a finalize method was ever supposed to do.
>
>
>
> Use a try/finally block instead.
>
>
>
> Phillip
>
>
>
>
>
> On Wed, 12 Jun 2019, 00:01 Nasrulla Khan Haris, <
> nasrulla.k...@microsoft.com.invalid> wrote:
>
> I want to delete some files which I created In my datasource api,  as soon
> as the RDD is cleaned up.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Vinoo Ganesh 
> *Sent:* Monday, June 10, 2019 1:32 PM
> *To:* Nasrulla Khan Haris ;
> dev@spark.apache.org
> *Subject:* Re: Adding Custom finalize method to RDDs.
>
>
>
> Generally overriding the finalize() method is an antipattern (it was in
> fact deprecated in java 11
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Object.html#finalize())
> . What’s the use case here?
>
>
>
> *From: *Nasrulla Khan Haris 
> *Date: *Monday, June 10, 2019 at 15:44
> *To: *"dev@spark.apache.org" 
> *Subject: *RE: Adding Custom finalize method to RDDs.
>
>
>
> Hello Everyone,
>
> Is there a way  to do it from user-code ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Nasrulla Khan Haris 
> *Sent:* Sunday, June 9, 2019 5:30 PM
> *To:* dev@spark.apache.org
> *Subject:* Adding Custom finalize method to RDDs.
>
>
>
> Hi All,
>
>
>
> Is there a way to add custom finalize method to RDD objects to add custom
> logic when RDDs are destructed by JVM ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
>


Re: Adding Custom finalize method to RDDs.

2019-06-12 Thread Phillip Henry
That's not the kind of thing a finalize method was ever supposed to do.

Use a try/finally block instead.

Phillip


On Wed, 12 Jun 2019, 00:01 Nasrulla Khan Haris,
 wrote:

> I want to delete some files which I created In my datasource api,  as soon
> as the RDD is cleaned up.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Vinoo Ganesh 
> *Sent:* Monday, June 10, 2019 1:32 PM
> *To:* Nasrulla Khan Haris ;
> dev@spark.apache.org
> *Subject:* Re: Adding Custom finalize method to RDDs.
>
>
>
> Generally overriding the finalize() method is an antipattern (it was in
> fact deprecated in java 11
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Object.html#finalize())
> . What’s the use case here?
>
>
>
> *From: *Nasrulla Khan Haris 
> *Date: *Monday, June 10, 2019 at 15:44
> *To: *"dev@spark.apache.org" 
> *Subject: *RE: Adding Custom finalize method to RDDs.
>
>
>
> Hello Everyone,
>
> Is there a way  to do it from user-code ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Nasrulla Khan Haris 
> *Sent:* Sunday, June 9, 2019 5:30 PM
> *To:* dev@spark.apache.org
> *Subject:* Adding Custom finalize method to RDDs.
>
>
>
> Hi All,
>
>
>
> Is there a way to add custom finalize method to RDD objects to add custom
> logic when RDDs are destructed by JVM ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>


Re: Hyperparameter Optimization via Randomization

2021-01-30 Thread Phillip Henry
Hi, Sean.

Perhaps I don't understand. As I see it, ParamGridBuilder builds an
Array[ParamMap]. What I am proposing is a new class that also builds an
Array[ParamMap] via its build() method, so there would be no "change in the
APIs". This new class would, of course, have methods that defined the
search space (log, linear, etc) over which random values were chosen.

Now, if this is too trivial to warrant the work and people prefer Hyperopt,
then so be it. It might be useful for people not using Python but they can
just roll-their-own, I guess.

Anyway, looking forward to hearing what you think.

Regards,

Phillip



On Fri, Jan 29, 2021 at 4:18 PM Sean Owen  wrote:

> I think that's a bit orthogonal - right now you can't specify continuous
> spaces. The straightforward thing is to allow random sampling from a big
> grid. You can create a geometric series of values to try, of course -
> 0.001, 0.01, 0.1, etc.
> Yes I get that if you're randomly choosing, you can randomly choose from a
> continuous space of many kinds. I don't know if it helps a lot vs the
> change in APIs (and continuous spaces don't make as much sense for grid
> search)
> Of course it helps a lot if you're doing a smarter search over the space,
> like what hyperopt does. For that, I mean, one can just use hyperopt +
> Spark ML already if desired.
>
> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
> wrote:
>
>> Thanks, Sean! I hope to offer a PR next week.
>>
>> Not sure about a dependency on the grid search, though - but happy to
>> hear your thoughts. I mean, you might want to explore logarithmic space
>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads to
>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>> evenly spaced in logarithmic space but not in linear space. So, saying what
>> fraction of a grid search to sample wouldn't make sense (unless the grid
>> was warped, of course).
>>
>> Does that make sense? It might be better for me to just write the code as
>> I don't think it would be very complicated.
>>
>> Happy to hear your thoughts.
>>
>> Phillip
>>
>>
>>
>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:
>>
>>> I don't know of anyone working on that. Yes I think it could be useful.
>>> I think it might be easiest to implement by simply having some parameter to
>>> the grid search process that says what fraction of all possible
>>> combinations you want to randomly test.
>>>
>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have no work at the moment so I was wondering if anybody would be
>>>> interested in me contributing code that generates an Array[ParamMap] for
>>>> random hyperparameters?
>>>>
>>>> Apparently, this technique can find a hyperparameter in the top 5% of
>>>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>>>
>>>> I notice that the Spark code base has only the brute force
>>>> ParamGridBuilder unless I am missing something.
>>>>
>>>> Hyperparameter optimization is an area of interest to me but I don't
>>>> want to re-invent the wheel. So, if this work is already underway or there
>>>> are libraries out there to do it please let me know and I'll shut up :)
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>> [1]
>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>>>
>>>


Hyperparameter Optimization via Randomization

2021-01-29 Thread Phillip Henry
Hi,

I have no work at the moment so I was wondering if anybody would be
interested in me contributing code that generates an Array[ParamMap] for
random hyperparameters?

Apparently, this technique can find a hyperparameter in the top 5% of
parameter space in fewer than 60 iterations with 95% confidence [1].

I notice that the Spark code base has only the brute force ParamGridBuilder
unless I am missing something.

Hyperparameter optimization is an area of interest to me but I don't want
to re-invent the wheel. So, if this work is already underway or there are
libraries out there to do it please let me know and I'll shut up :)

Regards,

Phillip

[1]
https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html


Re: Hyperparameter Optimization via Randomization

2021-02-08 Thread Phillip Henry
Hi, Sean.

I don't think sampling from a grid is a good idea as the min/max may lie
between grid points. Unconstrained random sampling avoids this problem. To
this end, I have an implementation at:

https://github.com/apache/spark/compare/master...PhillHenry:master

It is unit tested and does not change any already existing code.

Totally get what you mean about Hyperopt but this is a pure JVM solution
that's fairly straightforward.

Is it worth contributing?

Thanks,

Phillip





On Sat, Jan 30, 2021 at 2:00 PM Sean Owen  wrote:

> I was thinking ParamGridBuilder would have to change to accommodate a
> continuous range of values, and that's not hard, though other code wouldn't
> understand that type of value, like the existing simple grid builder.
> It's all possible just wondering if simply randomly sampling the grid is
> enough. That would be a simpler change, just a new method or argument.
>
> Yes part of it is that if you really want to search continuous spaces,
> hyperopt is probably even better, so how much do you want to put into
> Pyspark - something really simple sure.
> Not out of the question to do something more complex if it turns out to
> also be pretty simple.
>
> On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry 
> wrote:
>
>> Hi, Sean.
>>
>> Perhaps I don't understand. As I see it, ParamGridBuilder builds an
>> Array[ParamMap]. What I am proposing is a new class that also builds an
>> Array[ParamMap] via its build() method, so there would be no "change in the
>> APIs". This new class would, of course, have methods that defined the
>> search space (log, linear, etc) over which random values were chosen.
>>
>> Now, if this is too trivial to warrant the work and people prefer
>> Hyperopt, then so be it. It might be useful for people not using Python but
>> they can just roll-their-own, I guess.
>>
>> Anyway, looking forward to hearing what you think.
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen  wrote:
>>
>>> I think that's a bit orthogonal - right now you can't specify continuous
>>> spaces. The straightforward thing is to allow random sampling from a big
>>> grid. You can create a geometric series of values to try, of course -
>>> 0.001, 0.01, 0.1, etc.
>>> Yes I get that if you're randomly choosing, you can randomly choose from
>>> a continuous space of many kinds. I don't know if it helps a lot vs the
>>> change in APIs (and continuous spaces don't make as much sense for grid
>>> search)
>>> Of course it helps a lot if you're doing a smarter search over the
>>> space, like what hyperopt does. For that, I mean, one can just use
>>> hyperopt + Spark ML already if desired.
>>>
>>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
>>> wrote:
>>>
>>>> Thanks, Sean! I hope to offer a PR next week.
>>>>
>>>> Not sure about a dependency on the grid search, though - but happy to
>>>> hear your thoughts. I mean, you might want to explore logarithmic space
>>>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads to
>>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>>>> evenly spaced in logarithmic space but not in linear space. So, saying what
>>>> fraction of a grid search to sample wouldn't make sense (unless the grid
>>>> was warped, of course).
>>>>
>>>> Does that make sense? It might be better for me to just write the code
>>>> as I don't think it would be very complicated.
>>>>
>>>> Happy to hear your thoughts.
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>> On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:
>>>>
>>>>> I don't know of anyone working on that. Yes I think it could be
>>>>> useful. I think it might be easiest to implement by simply having some
>>>>> parameter to the grid search process that says what fraction of all
>>>>> possible combinations you want to randomly test.
>>>>>
>>>>> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have no work at the moment so I was wondering if anybody would be
>>>>>> interested in me contributing code that generates an Array[ParamMap] for
>>>>>> random hyperparameters?
>>>>>>
>>>>>> Apparently, this technique can find a hyperparameter in the top 5% of
>>>>>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>>>>>
>>>>>> I notice that the Spark code base has only the brute force
>>>>>> ParamGridBuilder unless I am missing something.
>>>>>>
>>>>>> Hyperparameter optimization is an area of interest to me but I don't
>>>>>> want to re-invent the wheel. So, if this work is already underway or 
>>>>>> there
>>>>>> are libraries out there to do it please let me know and I'll shut up :)
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>> [1]
>>>>>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>>>>>
>>>>>


Re: Hyperparameter Optimization via Randomization

2021-02-09 Thread Phillip Henry
Hi, Sean.

I've added a comment in the new class to suggest a look at Hyperopt etc if
the user is using Python.

Anyway I've created a pull request:

https://github.com/apache/spark/pull/31535

and all tests, style checks etc pass. Wish me luck :)

And thanks for the support :)

Phillip



On Mon, Feb 8, 2021 at 4:12 PM Sean Owen  wrote:

> It seems pretty reasonable to me. If it's a pull request we can code
> review it.
> My only question is just, would it be better to tell people to use
> hyperopt, and how much better is this than implementing randomization on
> the grid.
> But the API change isn't significant so maybe just fine.
>
> On Mon, Feb 8, 2021 at 3:49 AM Phillip Henry 
> wrote:
>
>> Hi, Sean.
>>
>> I don't think sampling from a grid is a good idea as the min/max may lie
>> between grid points. Unconstrained random sampling avoids this problem. To
>> this end, I have an implementation at:
>>
>> https://github.com/apache/spark/compare/master...PhillHenry:master
>>
>> It is unit tested and does not change any already existing code.
>>
>> Totally get what you mean about Hyperopt but this is a pure JVM solution
>> that's fairly straightforward.
>>
>> Is it worth contributing?
>>
>> Thanks,
>>
>> Phillip
>>
>>
>>
>>
>>
>> On Sat, Jan 30, 2021 at 2:00 PM Sean Owen  wrote:
>>
>>> I was thinking ParamGridBuilder would have to change to accommodate a
>>> continuous range of values, and that's not hard, though other code wouldn't
>>> understand that type of value, like the existing simple grid builder.
>>> It's all possible just wondering if simply randomly sampling the grid is
>>> enough. That would be a simpler change, just a new method or argument.
>>>
>>> Yes part of it is that if you really want to search continuous spaces,
>>> hyperopt is probably even better, so how much do you want to put into
>>> Pyspark - something really simple sure.
>>> Not out of the question to do something more complex if it turns out to
>>> also be pretty simple.
>>>
>>> On Sat, Jan 30, 2021 at 4:42 AM Phillip Henry 
>>> wrote:
>>>
>>>> Hi, Sean.
>>>>
>>>> Perhaps I don't understand. As I see it, ParamGridBuilder builds an
>>>> Array[ParamMap]. What I am proposing is a new class that also builds an
>>>> Array[ParamMap] via its build() method, so there would be no "change in the
>>>> APIs". This new class would, of course, have methods that defined the
>>>> search space (log, linear, etc) over which random values were chosen.
>>>>
>>>> Now, if this is too trivial to warrant the work and people prefer
>>>> Hyperopt, then so be it. It might be useful for people not using Python but
>>>> they can just roll-their-own, I guess.
>>>>
>>>> Anyway, looking forward to hearing what you think.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>
>>>>
>>>> On Fri, Jan 29, 2021 at 4:18 PM Sean Owen  wrote:
>>>>
>>>>> I think that's a bit orthogonal - right now you can't specify
>>>>> continuous spaces. The straightforward thing is to allow random sampling
>>>>> from a big grid. You can create a geometric series of values to try, of
>>>>> course - 0.001, 0.01, 0.1, etc.
>>>>> Yes I get that if you're randomly choosing, you can randomly choose
>>>>> from a continuous space of many kinds. I don't know if it helps a lot vs
>>>>> the change in APIs (and continuous spaces don't make as much sense for 
>>>>> grid
>>>>> search)
>>>>> Of course it helps a lot if you're doing a smarter search over the
>>>>> space, like what hyperopt does. For that, I mean, one can just use
>>>>> hyperopt + Spark ML already if desired.
>>>>>
>>>>> On Fri, Jan 29, 2021 at 9:01 AM Phillip Henry 
>>>>> wrote:
>>>>>
>>>>>> Thanks, Sean! I hope to offer a PR next week.
>>>>>>
>>>>>> Not sure about a dependency on the grid search, though - but happy to
>>>>>> hear your thoughts. I mean, you might want to explore logarithmic space
>>>>>> evenly. For example,  something like "please search 1e-7 to 1e-4" leads 
>>>>>> to
>>>>>> a reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
>>>>&

Re: Hyperparameter Optimization via Randomization

2021-01-29 Thread Phillip Henry
Thanks, Sean! I hope to offer a PR next week.

Not sure about a dependency on the grid search, though - but happy to hear
your thoughts. I mean, you might want to explore logarithmic space evenly.
For example,  something like "please search 1e-7 to 1e-4" leads to a
reasonably random sample being {3e-7, 2e-6, 9e-5}. These are (roughly)
evenly spaced in logarithmic space but not in linear space. So, saying what
fraction of a grid search to sample wouldn't make sense (unless the grid
was warped, of course).

Does that make sense? It might be better for me to just write the code as I
don't think it would be very complicated.

Happy to hear your thoughts.

Phillip



On Fri, Jan 29, 2021 at 1:47 PM Sean Owen  wrote:

> I don't know of anyone working on that. Yes I think it could be useful. I
> think it might be easiest to implement by simply having some parameter to
> the grid search process that says what fraction of all possible
> combinations you want to randomly test.
>
> On Fri, Jan 29, 2021 at 5:52 AM Phillip Henry 
> wrote:
>
>> Hi,
>>
>> I have no work at the moment so I was wondering if anybody would be
>> interested in me contributing code that generates an Array[ParamMap] for
>> random hyperparameters?
>>
>> Apparently, this technique can find a hyperparameter in the top 5% of
>> parameter space in fewer than 60 iterations with 95% confidence [1].
>>
>> I notice that the Spark code base has only the brute force
>> ParamGridBuilder unless I am missing something.
>>
>> Hyperparameter optimization is an area of interest to me but I don't want
>> to re-invent the wheel. So, if this work is already underway or there are
>> libraries out there to do it please let me know and I'll shut up :)
>>
>> Regards,
>>
>> Phillip
>>
>> [1]
>> https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
>>
>


K8s integration test failure ("credentials Jenkins is using is probably wrong...")

2021-02-23 Thread Phillip Henry
Hi,

Silly question: the Jenkins build for my PR is failing but it seems outside
of my control. What must I do to remedy this?

I've submitted

https://github.com/apache/spark/pull/31535

but Spark QA is telling me "Kubernetes integration test status failure".

The Jenkins job says "SUCCESS" but also barfs with:

FileNotFoundException means that the credentials Jenkins is using is
probably wrong. Or the user account does not have write access to the
repo.


See
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull

Can anybody please advise?

Thanks in advance.

Phillip


Log likelhood in GeneralizedLinearRegression

2022-01-22 Thread Phillip Henry
Hi,

As far as I know, there is no function to generate the log likelihood from
a GeneralizedLinearRegression model. Are there any plans to implement one?

I've coded my own in PySpark and in testing it agrees with the values we
get from the Python library StatsModels to one part in a million. It's
kinda yucky code as it relies on some inefficient UDFs but I could port it
to Scala.

Would anybody be interested in me raising a PR and coding an efficient
Scala implementation that can be called from PySpark?

Regards,

Phillip


SPARK-24156: Kafka messages left behind in Spark Structured Streaming

2023-10-19 Thread Phillip Henry
Hi, folks,

A few years ago, I asked about SSS not processing the final batch left on a
Kafka topic when using groupBy, OutputMode.Append and withWatermark.

At the time, Jungtaek Lim kindly pointed out (27/7/20) that this was
expected behaviour, that (if I have this correct) a message needs to arrive
to trigger Spark to write the lingering batch. The solution was "to add a
dummy record to move [the] watermark forward."

Looking at the comments in SPARK-24156, it seems people still find this
unintuitive. Would there be an appetite to address this, to ensure no
messages are left behind? Or is it a Sisyphean task whose complexity I
don't appreciate?

Regards,

Phillip


Data Contracts

2023-06-12 Thread Phillip Henry
Hi, folks.

There currently seems to be a buzz around "data contracts". From what I can
tell, these mainly advocate a cultural solution. But instead, could big
data tools be used to enforce these contracts?

My questions really are: are there any plans to implement data constraints
in Spark (eg, an integer must be between 0 and 100; the date in column X
must be before that in column Y)? And if not, is there an appetite for them?

Maybe we could associate constraints with schema metadata that are enforced
in the implementation of a FileFormatDataWriter?

Just throwing it out there and wondering what other people think. It's an
area that interests me as it seems that over half my problems at the day
job are because of dodgy data.

Regards,

Phillip


Re: Data Contracts

2023-06-13 Thread Phillip Henry
Hi, Fokko and Deepak.

The problem with DBT and Great Expectations (and Soda too, I believe) is
that by the time they find the problem, the error is already in production
- and fixing production can be a nightmare.

What's more, we've found that nobody ever looks at the data quality reports
we already generate.

You can, of course, run DBT, GT etc as part of a CI/CD pipeline but it's
usually against synthetic or at best sampled data (laws like GDPR generally
stop personal information data being anywhere but prod).

What I'm proposing is something that stops production data ever being
tainted.

Hi, Elliot.

Nice to see you again (we worked together 20 years ago)!

The problem here is that a schema itself won't protect me (at least as I
understand your argument). For instance, I have medical records that say
some of my patients are 999 years old which is clearly ridiculous but their
age correctly conforms to an integer data type. I have other patients who
were discharged *before* they were admitted to hospital. I have 28 patients
out of literally millions who recently attended hospital but were
discharged on 1/1/1900. As you can imagine, this made the average length of
stay (a key metric for acute hospitals) much lower than it should have
been. It only came to light when some average length of stays were
negative!

In all these cases, the data faithfully adhered to the schema.

Hi, Ryan.

This is an interesting point. There *should* indeed be a human connection
but often there isn't. For instance, I have a friend who complained that
his company's Zurich office made a breaking change and was not even aware
that his London based department existed, never mind depended on their
data. In large organisations, this is pretty common.

TBH, my proposal doesn't address this particular use case (maybe hooks and
metastore listeners would...?) But my point remains that although these
relationships should exist, in a sufficiently large organisation, they
generally don't. And maybe we can help fix that with code?

Would love to hear further thoughts.

Regards,

Phillip





On Tue, Jun 13, 2023 at 8:17 AM Fokko Driesprong  wrote:

> Hey Phillip,
>
> Thanks for raising this. I like the idea. The question is, should this be
> implemented in Spark or some other framework? I know that dbt has a fairly
> extensive way of testing your data
> <https://www.getdbt.com/product/data-testing/>, and making sure that you
> can enforce assumptions on the columns. The nice thing about dbt is that it
> is built from a software engineering perspective, so all the tests (or
> contracts) are living in version control. Using pull requests you could
> collaborate on changing the contract and making sure that the change has
> gotten enough attention before pushing it to production. Hope this helps!
>
> Kind regards,
> Fokko
>
> Op di 13 jun 2023 om 04:31 schreef Deepak Sharma :
>
>> Spark can be used with tools like great expectations as well to implement
>> the data contracts .
>> I am not sure though if spark alone can do the data contracts .
>> I was reading a blog on data mesh and how to glue it together with data
>> contracts , that’s where I came across this spark and great expectations
>> mention .
>>
>> HTH
>>
>> -Deepak
>>
>> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West  wrote:
>>
>>> Hi Phillip,
>>>
>>> While not as fine-grained as your example, there do exist schema systems
>>> such as that in Avro that can can evaluate compatible and incompatible
>>> changes to the schema, from the perspective of the reader, writer, or both.
>>> This provides some potential degree of enforcement, and means to
>>> communicate a contract. Interestingly I believe this approach has been
>>> applied to both JsonSchema and protobuf as part of the Confluent Schema
>>> registry.
>>>
>>> Elliot.
>>>
>>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry 
>>> wrote:
>>>
>>>> Hi, folks.
>>>>
>>>> There currently seems to be a buzz around "data contracts". From what I
>>>> can tell, these mainly advocate a cultural solution. But instead, could big
>>>> data tools be used to enforce these contracts?
>>>>
>>>> My questions really are: are there any plans to implement data
>>>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>>>> column X must be before that in column Y)? And if not, is there an appetite
>>>> for them?
>>>>
>>>> Maybe we could associate constraints with schema metadata that are
>>>> enforced in the implementation of a FileFormatDataWriter?
>>>>
>>>> Just throwing it out there and wondering what other people think. It's
>>>> an area that interests me as it seems that over half my problems at the day
>>>> job are because of dodgy data.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>>


Re: Data Contracts

2023-07-16 Thread Phillip Henry
No worries. Have you had a chance to look at it?

Since this thread has gone dead, I assume there is no appetite for adding
data contract functionality..?

Regards,

Phillip


On Mon, 19 Jun 2023, 11:23 Deepak Sharma,  wrote:

> Sorry for using simple in my last email .
> It’s not gonna to be simple in any terms .
> Thanks for sharing the git Philip .
> Will definitely go through it .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 3:47 PM, Phillip Henry 
> wrote:
>
>> I think it might be a bit more complicated than this (but happy to be
>> proved wrong).
>>
>> I have a minimum working example at:
>>
>> https://github.com/PhillHenry/SparkConstraints.git
>>
>> that runs out-of-the-box (mvn test) and demonstrates what I am trying to
>> achieve.
>>
>> A test persists a DataFrame that conforms to the contract and
>> demonstrates that one that does not, throws an Exception.
>>
>> I've had to slightly modify 3 Spark files to add the data contract
>> functionality. If you can think of a more elegant solution, I'd be very
>> grateful.
>>
>> Regards,
>>
>> Phillip
>>
>>
>>
>>
>> On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma 
>> wrote:
>>
>>> It can be as simple as adding a function to the spark session builder
>>> specifically on the read  which can take the yaml file(definition if data
>>> co tracts to be in yaml) and apply it to the data frame .
>>> It can ignore the rows not matching the data contracts defined in the
>>> yaml .
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
>>> wrote:
>>>
>>>> For my part, I'm not too concerned about the mechanism used to
>>>> implement the validation as long as it's rich enough to express the
>>>> constraints.
>>>>
>>>> I took a look at JSON Schemas (for which there are a number of JVM
>>>> implementations) but I don't think it can handle more complex data types
>>>> like dates. Maybe Elliot can comment on this?
>>>>
>>>> Ideally, *any* reasonable mechanism could be plugged in.
>>>>
>>>> But what struck me from trying to write a Proof of Concept was that it
>>>> was quite hard to inject my code into this particular area of the Spark
>>>> machinery. It could very well be due to my limited understanding of the
>>>> codebase, but it seemed the Spark code would need a bit of a refactor
>>>> before a component could be injected. Maybe people in this forum with
>>>> greater knowledge in this area could comment?
>>>>
>>>> BTW, it's interesting to see that Databrick's "Delta Live Tables"
>>>> appear to be attempting to implement data contracts within their ecosystem.
>>>> Unfortunately, I think it's closed source and Python only.
>>>>
>>>> Regards,
>>>>
>>>> Phillip
>>>>
>>>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> It would be interesting if we think about creating a contract
>>>>> validation library written in JSON format. This would ensure a validation
>>>>> mechanism that will rely on this library and could be shared among 
>>>>> relevant
>>>>> parties. Will that be a starting point?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 14 Jun 2023 at 11:13, Jean-

Re: Data Contracts

2023-06-19 Thread Phillip Henry
I think it might be a bit more complicated than this (but happy to be
proved wrong).

I have a minimum working example at:

https://github.com/PhillHenry/SparkConstraints.git

that runs out-of-the-box (mvn test) and demonstrates what I am trying to
achieve.

A test persists a DataFrame that conforms to the contract and demonstrates
that one that does not, throws an Exception.

I've had to slightly modify 3 Spark files to add the data contract
functionality. If you can think of a more elegant solution, I'd be very
grateful.

Regards,

Phillip




On Mon, Jun 19, 2023 at 9:37 AM Deepak Sharma  wrote:

> It can be as simple as adding a function to the spark session builder
> specifically on the read  which can take the yaml file(definition if data
> co tracts to be in yaml) and apply it to the data frame .
> It can ignore the rows not matching the data contracts defined in the yaml
> .
>
> Thanks
> Deepak
>
> On Mon, 19 Jun 2023 at 1:49 PM, Phillip Henry 
> wrote:
>
>> For my part, I'm not too concerned about the mechanism used to implement
>> the validation as long as it's rich enough to express the constraints.
>>
>> I took a look at JSON Schemas (for which there are a number of JVM
>> implementations) but I don't think it can handle more complex data types
>> like dates. Maybe Elliot can comment on this?
>>
>> Ideally, *any* reasonable mechanism could be plugged in.
>>
>> But what struck me from trying to write a Proof of Concept was that it
>> was quite hard to inject my code into this particular area of the Spark
>> machinery. It could very well be due to my limited understanding of the
>> codebase, but it seemed the Spark code would need a bit of a refactor
>> before a component could be injected. Maybe people in this forum with
>> greater knowledge in this area could comment?
>>
>> BTW, it's interesting to see that Databrick's "Delta Live Tables" appear
>> to be attempting to implement data contracts within their ecosystem.
>> Unfortunately, I think it's closed source and Python only.
>>
>> Regards,
>>
>> Phillip
>>
>> On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It would be interesting if we think about creating a contract validation
>>> library written in JSON format. This would ensure a validation mechanism
>>> that will rely on this library and could be shared among relevant parties.
>>> Will that be a starting point?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>>>
>>>> Hi,
>>>>
>>>> While I was at PayPal, we open sourced a template of Data Contract, it
>>>> is here: https://github.com/paypal/data-contract-template. Companies
>>>> like GX (Great Expectations) are interested in using it.
>>>>
>>>> Spark could read some elements form it pretty easily, like schema
>>>> validation, some rules validations. Spark could also generate an embryo of
>>>> data contracts…
>>>>
>>>> —jgp
>>>>
>>>>
>>>> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>> From my limited understanding of data contracts, there are two factors
>>>> that deem necessary.
>>>>
>>>>
>>>>1. procedure matter
>>>>2. technical matter
>>>>
>>>> I mean this is nothing new. Some tools like Cloud data fusion can
>>>> assist when the procedures are validated. Simply "The process of
>>>> integrating multiple data sources to produce more consistent, accurate, and
>>>> useful information than that provided by any individual data source."

Re: Data Contracts

2023-06-19 Thread Phillip Henry
For my part, I'm not too concerned about the mechanism used to implement
the validation as long as it's rich enough to express the constraints.

I took a look at JSON Schemas (for which there are a number of JVM
implementations) but I don't think it can handle more complex data types
like dates. Maybe Elliot can comment on this?

Ideally, *any* reasonable mechanism could be plugged in.

But what struck me from trying to write a Proof of Concept was that it was
quite hard to inject my code into this particular area of the Spark
machinery. It could very well be due to my limited understanding of the
codebase, but it seemed the Spark code would need a bit of a refactor
before a component could be injected. Maybe people in this forum with
greater knowledge in this area could comment?

BTW, it's interesting to see that Databrick's "Delta Live Tables" appear to
be attempting to implement data contracts within their ecosystem.
Unfortunately, I think it's closed source and Python only.

Regards,

Phillip

On Sat, Jun 17, 2023 at 11:06 AM Mich Talebzadeh 
wrote:

> It would be interesting if we think about creating a contract validation
> library written in JSON format. This would ensure a validation mechanism
> that will rely on this library and could be shared among relevant parties.
> Will that be a starting point?
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 14 Jun 2023 at 11:13, Jean-Georges Perrin  wrote:
>
>> Hi,
>>
>> While I was at PayPal, we open sourced a template of Data Contract, it is
>> here: https://github.com/paypal/data-contract-template. Companies like
>> GX (Great Expectations) are interested in using it.
>>
>> Spark could read some elements form it pretty easily, like schema
>> validation, some rules validations. Spark could also generate an embryo of
>> data contracts…
>>
>> —jgp
>>
>>
>> On Jun 13, 2023, at 07:25, Mich Talebzadeh 
>> wrote:
>>
>> From my limited understanding of data contracts, there are two factors
>> that deem necessary.
>>
>>
>>1. procedure matter
>>2. technical matter
>>
>> I mean this is nothing new. Some tools like Cloud data fusion can assist
>> when the procedures are validated. Simply "The process of integrating
>> multiple data sources to produce more consistent, accurate, and useful
>> information than that provided by any individual data source.". In the old
>> time, we had staging tables that were used to clean and prune data from
>> multiple sources. Nowadays we use the so-called Integration layer. If you
>> use Spark as an ETL tool, then you have to build this validation yourself.
>> Case in point, how to map customer_id from one source to customer_no from
>> another. Legacy systems are full of these anomalies. MDM can help but
>> requires human intervention which is time consuming. I am not sure the role
>> of Spark here except being able to read the mapping tables.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 13 Jun 2023 at 10:01, Phillip Henry 
>> wrote:
>>
>>> Hi, Fokko and Deepak.
>>>
>>> The problem with DBT and Great Expectations (and Soda too, I believe) is
>>> that by the time they find the problem, the error is already in production
>>> - and fixing production can be a nightmare.
>>>
>>>