[PR] Add mechanism for skipping batching if your data is prebatched [beam]

via GitHub Sat, 13 Apr 2024 10:19:03 -0700


damccorm opened a new pull request, #30961:
URL: https://github.com/apache/beam/pull/30961


   In some scenarios, it is helpful to prebatch data before getting to 
RunInference. This enables the user to skip batching in these situations. For 
example, if you're doing image classification in a streaming pipeline, often 
your flow is:
   
   `Read from source (small data per element) -> download images (large data 
per element) -> inference`
   
   Ideally, you'd do cross-bundle batching, since in streaming pipelines 
bundles may be too small to get an advantage out of in-bundle batching. But 
this is much more expensive if you do it after downloading images since the 
shuffle is more expensive with larger data.
   
   Today, our recommendation is to either:
   
   1) `Read from source (small data per element) -> BatchElements -> download 
images (large data per element) -> RunInference (with max_batch_size=1)` - this 
requires overriding your `run_inference` function to immediately remove the 
batching dimension, like:
   
   ```
   run_inference(<params>, batch: Sequence[Sequence[ExampleT]], <params>):
      real_batch = batch[0]
      ...
   ```
   
   or:
   
   2) `Read from source (small data per element) -> RunInference (with cross 
bundle batching)` - this requires overriding your `run_inference` function to 
download images, like:
   
   ```
   run_inference(<params>, batch: Sequence[ExampleT], <params>):
      real_batch = [download_and_preprocess_image(example) for example in 
batch] # Probably actually something async here
   ```
   
   Both options are awkward and forces the user to modify their model 
handler/use a custom handler.
   
   With this change, you could now simplify the flow to:
   
   `Read from source (small data per element) -> Cross-bundle batching -> 
download images (large data per element) -> RunInference without batching`
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: 
`addresses #123`), if applicable. This will automatically add a link to the 
pull request in the issue. If you would like the issue to automatically close 
on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).
   
   To check the build health, please visit 
[https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   
------------------------------------------------------------------------------------------------
   [![Build python source distribution and 
wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python 
tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java 
tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go 
tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more 
information about GitHub Actions CI or the [workflows 
README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) 
to see a list of phrases to trigger workflows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add mechanism for skipping batching if your data is prebatched [beam]

Reply via email to