echauchot commented on PR #24009:
URL: https://github.com/apache/beam/pull/24009#issuecomment-1313751222

   > This PR changes the PipelineTranslator to support caching datasets if they 
are used multiple times. This is an outstanding feature compared to the RDD 
runner and critical to run batch pipelines on large scale as it prevents the 
potentially costly re-evaluation of datasets (closes #24008). Additionally this 
PR is a preparation to fix / simplify handling of side inputs (see #24035).
   
   Why is it that this PR addresses 2 separate tickets when the guidelines 
state to address only one per PR (for isolation, diagnostic, and revert 
reasons) ? Can't the 2 issues be addressed separately ? They are too coupled ?
   
   > 
   > In our benchmarks, considering the insignificant scale of both nexmark and 
tpc-ds (1GB) tests, the impact of this is expected to be neutral or even 
negative.
   
   Sure, maybe we should setup a different CI job with TPCDS runs in higher 
scale in a different PR.
   
   > 
   > Pipeline translation is changed as follows:
   > 
   > 1. Detect if `streaming` mode is required.
   >    
   >    * As is, just removed unnecessary clutter.
   > 2. Identify datasets that are repeatedly used as input and should be 
cached.
   >    
   >    * New translation step, similar to the one done in the RDD runner.
   >    * Rather than just gathering translation state as `Map<PCollection, 
Dataset>`, this has turned into `Map<PCollection, TranslationResult>` to track 
the `Dataset` but also additional metadata such as dependencies (and broadcast 
variables once [[Bug]:  Improve handling of side inputs in Spark dataset runnerĀ 
#24035](https://github.com/apache/beam/issues/24035) is addressed)
   > 3. And finally, translate each primitive or composite `PTransform` that 
has a `TransformTranslator` and is supported 
(`TransformTranslator.canTranslate`) into its Spark correspondence. If a 
composite is not supported, it will be expanded further into its parts and 
translated then.
   >    
   >    * Logically as is, but using the extended state.
   >    * Also, this PR attempts to better distinguish between translation and 
evaluation. Instead of returning a translation context after the translation, 
this translation state is dropped as it is (mostly) not needed anymore. Only 
the necessary parts (leave datasets) 
   you mean **leaves** datasets, meaning at the very end of the dataset tree, 
right ?
   
   are returned as `EvaluationContext`, that can be used to trigger the 
evaluation of the pipeline.
   > 
   > Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   > 
   > * [ ]  [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
   > * [ ]  Mention the appropriate issue in your description (for example: 
`addresses #123`), if applicable. This will automatically add a link to the 
pull request in the issue. If you would like the issue to automatically close 
on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
   > * [ ]  Update `CHANGES.md` with noteworthy changes.
   > * [ ]  If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   > 
   > See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   > 
   > To check the build health, please visit 
https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
   > 
   > ## GitHub Actions Tests Status (on master branch)
   > [![Build python source distribution and 
wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
 [![Python 
tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
 [![Java 
tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
 [![Go 
tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   > 
   > See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more 
information about GitHub Actions CI.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to