damccorm opened a new pull request, #30961:
URL: https://github.com/apache/beam/pull/30961
In some scenarios, it is helpful to prebatch data before getting to
RunInference. This enables the user to skip batching in these situations. For
example, if you're doing image classification in a streaming pipeline, often
your flow is:
`Read from source (small data per element) -> download images (large data
per element) -> inference`
Ideally, you'd do cross-bundle batching, since in streaming pipelines
bundles may be too small to get an advantage out of in-bundle batching. But
this is much more expensive if you do it after downloading images since the
shuffle is more expensive with larger data.
Today, our recommendation is to either:
1) `Read from source (small data per element) -> BatchElements -> download
images (large data per element) -> RunInference (with max_batch_size=1)` - this
requires overriding your `run_inference` function to immediately remove the
batching dimension, like:
```
run_inference(<params>, batch: Sequence[Sequence[ExampleT]], <params>):
real_batch = batch[0]
...
```
or:
2) `Read from source (small data per element) -> RunInference (with cross
bundle batching)` - this requires overriding your `run_inference` function to
download images, like:
```
run_inference(<params>, batch: Sequence[ExampleT], <params>):
real_batch = [download_and_preprocess_image(example) for example in
batch] # Probably actually something async here
```
Both options are awkward and forces the user to modify their model
handler/use a custom handler.
With this change, you could now simplify the flow to:
`Read from source (small data per element) -> Cross-bundle batching ->
download images (large data per element) -> RunInference without batching`
------------------------
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [ ] Mention the appropriate issue in your description (for example:
`addresses #123`), if applicable. This will automatically add a link to the
pull request in the issue. If you would like the issue to automatically close
on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
See the [Contributor Guide](https://beam.apache.org/contribute) for more
tips on [how to make review process
smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier).
To check the build health, please visit
[https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
GitHub Actions Tests Status (on master branch)
------------------------------------------------------------------------------------------------
[](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
[](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
[](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
[](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more
information about GitHub Actions CI or the [workflows
README](https://github.com/apache/beam/blob/master/.github/workflows/README.md)
to see a list of phrases to trigger workflows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]