Hi all,

I recently had an experience where a streaming pipeline became "clogged"
due to invalid data reaching the final step in my pipeline such that the
data was causing a non-transient error when writing to my Sink.  Since the
job is a streaming job, the element (bundle) was continuously retrying.

What options are there for getting out of this state when it occurs? I
attempted to add validation and update the streaming job to remove the bad
entity; though the update was successful, I believe the bad entity was
already checkpointed (?) further downstream in the pipeline. What then?

And for something like a database schema and evolving it over time, what is
the typical solution?

- Should pipelines mirror a DB schema and do validation of all data types
in the pipeline?
- Should all sinks implement a way to remove non-transient failures from
retrying and output them via PCollectionTuple (such as with BigQuery failed
inserts)?

Reply via email to