je-ik commented on issue #31981:
URL: https://github.com/apache/beam/issues/31981#issuecomment-2596057099
If GBK result does not fit into memory, but we need to reiterate it, there
currently is no good option in FlinkRunner. It would be possible to cache only
the smaller side of CoGBK, but the runner does not know which one it is. There
is related [discussion on ML][1], which suggests adding annotations to be able
to specify such hints to the runner. I will try to generalize the annotations,
so that it could be used for this case as well. The solution could then be
applying the CoGBK like
```java
CoGroupByKey.create().addAnnotation(CoGroupByKey.CACHE_LEFT)
```
The runner could then use this annotation to avoid caching the complete
result of underlying GBK.
Alternative approach is to avoid CoGBK for large PCollections and use state
API and implement it manually.
[1]: https://lists.apache.org/thread/bnx69vjn8r8mpkc63wq4qnv72o43wc6p
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]