damccorm opened a new issue, #20969:
URL: https://github.com/apache/beam/issues/20969
We use Dataflow with Apache Beam to read events from Kinesis streams.
Recently, we've spotted that in a case when one of the streams was not
available in the middle of events processing (due to removal or problem with
the credentials), the data watermark for this stream was still being updated.
Imagine scenario:
- Permissions allow to read from stream A
- Data is read from stream A
- Permissions are changed and don’t allow to read from stream A
- Watermark for stream A is progressing (but stream data is not read due to
permissions issue)
- Permissions are fixed to read stream A
- Data is read from stream A but from the updated watermark
As a result, stream data between steps 3-5 is lost and the client doesn’t
know that.
Additionally, it may be confusing from the Dataflow console perspective, as
it suggests that events are still being read from the stream. It is hard to
rely on the watermark as a source metric for alerting purposes as well.
Brief investigation suggests that maybe the _KinesisReader.getWatermark()_
logic doesn’t consider the state of the stream i.e. is it available or not, and
it treats the removed stream as a stream without traffic. Watermark calculation
should be adjusted to take that information into account.
Imported from Jira
[BEAM-12406](https://issues.apache.org/jira/browse/BEAM-12406). Original Jira
may contain additional context.
Reported by: mateuszratajocado.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]