Hi dev,

I would like to hear voices about deprecating Trigger.Once, and replacing
it with Trigger.AvailableNow [1] in Structured Streaming.

Rationalization:

The expected behavior of Trigger.Once is like reading all available data
after the last trigger and processing them. This holds true when the last
run was gracefully terminated, but there are cases streaming queries to not
be terminated gracefully. There is a possibility the last run may write the
offset (WAL) for the new batch before termination, then a new run of
Trigger.Once only processes the data which was built in the latest
unfinished batch, and doesn't process new data.

The behavior is not deterministic from the users' point of view, as end
users wouldn't know whether the last run wrote the offset or not, unless
they look into the query's checkpoint by themselves.

While Trigger.AvailableNow came to solve the scalability issue on
Trigger.Once, it also ensures that it tries to process all available data
at the point of time it is triggered, which consistently works as expected
behavior of Trigger.Once.

Proposed Plan:

- Deprecate Trigger.Once in Apache Spark 3.3
- Leave guidance to migrate to Trigger.AvailableNow in migration guide
- Replace all usages of Trigger.Once with Trigger.AvailableNow, except the
test cases of Trigger.Once itself

Please review the proposal and share your voice on this.

Thanks!
Jungtaek Lim

1. https://issues.apache.org/jira/browse/SPARK-36533

Reply via email to