Re: [PR] Windowing Support for the DaskRunner [beam]

via GitHub Fri, 22 Dec 2023 16:04:09 -0800


cisaacstern commented on PR #27618:
URL: https://github.com/apache/beam/pull/27618#issuecomment-1868141165


   > and all the tests here are now passing for me locally 🎉
   
   Today I realized that the side input tests that existed in this PR were not 
actually testing side inputs (they were testing arg/kwarg passing, which IIUC, 
is not the same thing). So 
https://github.com/apache/beam/pull/27618/commits/3c0fbc68d2734945b5a560557251dd2c290fc325
 makes the side input tests actually test side inputs.
   
   And these tests do **_not_** yet pass. 😄 
   
   (To be clear the upstream `dask` work mentioned in the previous comment is 
still _totally essential_, insofar as it unlocks the ability to do _any real 
testing at all_. So the discovery of these test failures is actually a  ✨ 
*positive development* ✨ in a way, because it reflects progress here: we can 
finally actually test, and therefore find out what works and what doesn't!)
   
   A fair amount of debugging around led to the skeleton `SideInputMap` draft 
in 
https://github.com/apache/beam/pull/27618/commits/8dc89718568ab917ca658ff8db4affcc1de828c1,
 which is not yet functional but hopefully at least a basis for further 
consideration.
   
   On a separate note, during debugging I found myself changing the runner as 
follows, to allow for synchronous execution:
   
   ```python
   dask_options = options.view_as(DaskOptions).get_all_options(
           drop_default=True)
   # comment-out the client so we don't start a 
   # thread/process pool during synchronous testing
   # client = ddist.Client(**dask_options)
   
   pipeline.replace_all(dask_overrides())
   
   dask_visitor = self.to_dask_bag_visitor()
   pipeline.visit(dask_visitor)
   opt_graph = dask.optimize(*list(dask_visitor.bags.values()))
   # execute synchronously for easier debugging
   opt_graph[-1].compute(scheduler="sync")
   # futures = client.compute(opt_graph)
   # return DaskRunnerResult(client, futures)
   ```
   
   Possibly there's a way to make that configuration settable via 
PipelineOptions, and perhaps also as the default setup for 
`TestPipeline(runner=DaskRunner)` (though we wouldn't want to miss out on all 
of the multiprocessing bugs that we catch by using a thread/process pool... 
curious to look into how this is handled in DirectRunner testing).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Windowing Support for the DaskRunner [beam]

Reply via email to