[Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

Evan Galpin Tue, 10 Aug 2021 08:54:43 -0700

Hi all,

I recently had an experience where a streaming pipeline became "clogged"
due to invalid data reaching the final step in my pipeline such that the
data was causing a non-transient error when writing to my Sink.  Since the
job is a streaming job, the element (bundle) was continuously retrying.


What options are there for getting out of this state when it occurs? I
attempted to add validation and update the streaming job to remove the bad
entity; though the update was successful, I believe the bad entity was
already checkpointed (?) further downstream in the pipeline. What then?

And for something like a database schema and evolving it over time, what is
the typical solution?

- Should pipelines mirror a DB schema and do validation of all data types
in the pipeline?
- Should all sinks implement a way to remove non-transient failures from
retrying and output them via PCollectionTuple (such as with BigQuery failed
inserts)?

[Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

Reply via email to