echauchot commented on pull request #15381:
URL: https://github.com/apache/beam/pull/15381#issuecomment-938449865


   > @echauchot It's highly possible that I have a fundamental misunderstanding 
about how to handle window data. I'm going to try to outline my goals and 
challenges, and take the proposed implementation out of focus.
   > 
   > Goals:
   > 
   > * Change BulkIO/Write to output PCollectionTuple rather than PDone, in 
order to support reporting the status of indexing each input document
   > * Leave windows/timestampes/etc of input data entirely unaltered
   > 
   > Challenges:
   > 
   > * BulkIOBaseFn relies on buffering inputs, either using bundles or 
Stateful specs
   > * BulkIOBaseFn#finishBundle() must be called to ensure that any buffered 
inputs are sent to ES, and as of this PR, output to the PCollectionTuple
   > * DoFn.FinishBundle mehtods can accept a 
[FinishBundleContext](https://beam.apache.org/releases/javadoc/2.32.0/org/apache/beam/sdk/transforms/DoFn.FinishBundleContext.html)
 in order to output elements
   > * FinishBundleContext output methods all require explicit specification of 
a BoundedWindow instance
   > 
   > I got a bit stuck on that last point. My impression was that in order to 
ensure buffered docs' results were output, and in order to leave those 
elements' windows unaltered, I needed to keep track of the windows to which 
those elements belong so that they could be explicitly passed to 
FinishBundleContext#output.
   > 
   > I'd definitely be keen to learn more about how to handle windows and 
challenge my assumptions here. Thanks for your time @echauchot in reviewing and 
teaching.
   
   Just to write here what we discussed privately yesterday (Apache way: what 
did not happen publicly did not happen at all):
   - My bad, I did not know about the `DoFn#finishBundle()`  signature change. 
It now forces specifying a window in the output. I first thought that dealing 
with windows was not needed and brought unnecessary complexity but it seems it 
is mandatory :smile: 
   but I hope it is ony temporary until `OutputReceiver` is fleshed out.
   - I took a look at the existing IOs that ouput data to PCollections: 
     - `FhirIO`: outputs to last seen window. Seem incorrect.
     - `HadoopFormatIO` and `BigqueryIO` store to map keyed by window to then 
output per window similarly to what you do
     => So I guess the general window maintaining looks good. I need to look in 
more details at the code to give LGTM
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to