je-ik commented on issue #31981:
URL: https://github.com/apache/beam/issues/31981#issuecomment-2596057099

   If GBK result does not fit into memory, but we need to reiterate it, there 
currently is no good option in FlinkRunner. It would be possible to cache only 
the smaller side of CoGBK, but the runner does not know which one it is. There 
is related [discussion on ML][1], which suggests adding annotations to be able 
to specify such hints to the runner. I will try to generalize the annotations, 
so that it could be used for this case as well. The solution could then be 
applying the CoGBK like
   
   ```java
    CoGroupByKey.create().addAnnotation(CoGroupByKey.CACHE_LEFT)
   ```
   
   The runner could then use this annotation to avoid caching the complete 
result of underlying GBK.
   
   Alternative approach is to avoid CoGBK for large PCollections and use state 
API and implement it manually.
   
   [1]: https://lists.apache.org/thread/bnx69vjn8r8mpkc63wq4qnv72o43wc6p


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to