echauchot commented on PR #24009: URL: https://github.com/apache/beam/pull/24009#issuecomment-1313751222
> This PR changes the PipelineTranslator to support caching datasets if they are used multiple times. This is an outstanding feature compared to the RDD runner and critical to run batch pipelines on large scale as it prevents the potentially costly re-evaluation of datasets (closes #24008). Additionally this PR is a preparation to fix / simplify handling of side inputs (see #24035). Why is it that this PR addresses 2 separate tickets when the guidelines state to address only one per PR (for isolation, diagnostic, and revert reasons) ? Can't the 2 issues be addressed separately ? They are too coupled ? > > In our benchmarks, considering the insignificant scale of both nexmark and tpc-ds (1GB) tests, the impact of this is expected to be neutral or even negative. Sure, maybe we should setup a different CI job with TPCDS runs in higher scale in a different PR. > > Pipeline translation is changed as follows: > > 1. Detect if `streaming` mode is required. > > * As is, just removed unnecessary clutter. > 2. Identify datasets that are repeatedly used as input and should be cached. > > * New translation step, similar to the one done in the RDD runner. > * Rather than just gathering translation state as `Map<PCollection, Dataset>`, this has turned into `Map<PCollection, TranslationResult>` to track the `Dataset` but also additional metadata such as dependencies (and broadcast variables once [[Bug]: Improve handling of side inputs in Spark dataset runnerĀ #24035](https://github.com/apache/beam/issues/24035) is addressed) > 3. And finally, translate each primitive or composite `PTransform` that has a `TransformTranslator` and is supported (`TransformTranslator.canTranslate`) into its Spark correspondence. If a composite is not supported, it will be expanded further into its parts and translated then. > > * Logically as is, but using the extended state. > * Also, this PR attempts to better distinguish between translation and evaluation. Instead of returning a translation context after the translation, this translation state is dropped as it is (mostly) not needed anymore. Only the necessary parts (leave datasets) you mean **leaves** datasets, meaning at the very end of the dataset tree, right ? are returned as `EvaluationContext`, that can be used to trigger the evaluation of the pipeline. > > Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: > > * [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). > * [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead. > * [ ] Update `CHANGES.md` with noteworthy changes. > * [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). > > See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier). > > To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md > > ## GitHub Actions Tests Status (on master branch) > [](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) > > See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
