aljoscha opened a new pull request #14312:
URL: https://github.com/apache/flink/pull/14312
Right now, we don't support `DataStream.connect(BroadcastStream)` in `BATCH`
execution mode. I believe we can add support for this with not too much work.
The key insight is that we can process the broadcast side before the
non-broadcast side. Initially, we were shying away from this because of
concerns about `ctx.applyToKeyedState()` which allows the broadcast side of the
user function to access/iterate over state from the keyed side. We thought that
we couldn't support this. However, since we know that we process the broadcast
side first we know that the keyed side will always be empty when doing so. We
can thus just make this "keyed iteration" call a no-op, instead of throwing an
exception as we do now.
This is work-in-progress. I have yet to add tests.
## Brief change log
* turn broadcast operation into a logical `Transformation`
* add `BATCH` operators for broadcast
* add translators
## Verifying this change
To be done.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? maybe
We need to remove the caveat from the documentation. 🎊
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]