There are many projects in the Spark ecosystem — like Deequ and Great 
Expectations — that are focused on expressing and enforcing data quality checks.

In the more complex cases, these checks do not fit the scope of the checks that 
a typical data source may support (i.e. PK, FK, CHECK), so these projects have 
designed their own APIs for expressing various constraints, plus their own 
analyzers for figuring out how to run all the checks with as little repeated 
scanning or computation as possible.

It’s a bit forward looking since this is still a proposal, but I wonder if the 
Pipelines API will eventually be able to address this kind of use case 
directly. I believe arbitrary data  constraints can be expressed naturally as 
materialized views in a pipeline 
<https://nchammas.com/writing/query-language-constraint-language>, and a 
hypothetical Pipelines API could look something like this:

@pipelines.assertion
def sufficient_on_call_coverage():
    return (
        spark.table(“doctors")
        .where(col("on_call"))
        .select(count("*") >= 1)
    )
With visibility into the dependencies of a given assertion or constraint, a 
Pipeline could perhaps figure out how to enforce it without wasting resources 
rereading or recomputing stuff across nodes in the graph.

Again, this is a bit  forward looking, but is this kind of idea “on topic” for 
the Pipelines API?

Nick



> On Apr 5, 2025, at 5:30 PM, Sandy Ryza <sa...@apache.org> wrote:
> 
> Hi all – starting a discussion thread for a SPIP that I've been working on 
> with Chao Sun, Kent Yao, Yuming Wang, and Jie Yang: [JIRA 
> <https://issues.apache.org/jira/browse/SPARK-51727>] [Doc 
> <https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0>].
> 
> The SPIP proposes extending Spark's lazy, declarative execution model beyond 
> single queries, to pipelines that keep multiple datasets up to date. It 
> introduces the ability to compose multiple transformations into a single 
> declarative dataflow graph.
> 
> Declarative pipelines aim to simplify the development and management of data 
> pipelines, by  removing the need for manual orchestration of dependencies and 
> making it possible to catch many errors before any execution steps are 
> launched.
> 
> Declarative pipelines can include both batch and streaming computations, 
> leveraging Structured Streaming for stream processing and new materialized 
> view syntax for batch processing. Tight integration with Spark SQL's analyzer 
> enables deeper analysis and earlier error detection than is achievable with 
> more generic frameworks.
> 
> Let us know what you think!
> 

Reply via email to