[
https://issues.apache.org/jira/browse/BEAM-8816?focusedWorklogId=354909&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354909
]
ASF GitHub Bot logged work on BEAM-8816:
----------------------------------------
Author: ASF GitHub Bot
Created on: 06/Dec/19 03:23
Start Date: 06/Dec/19 03:23
Worklog Time Spent: 10m
Work Description: tweise commented on pull request #10313: [BEAM-8816]
Option to load balance bundle processing w/ multiple SDK workers
URL: https://github.com/apache/beam/pull/10313
A new pipeline option that allows for bundles of a given executable stage to
be processed with any available SDK worker (the default is to process all
bundles with the same SDK worker).
https://lists.apache.org/thread.html/59c02d8b8ea849c158deb39ad9d83af4d8fcb56570501c7fe8f79bb2%40%3Cdev.beam.apache.org%3E
------------------------
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [ ] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
See the [Contributor Guide](https://beam.apache.org/contribute) for more
tips on [how to make review process
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/)
| --- | --- | [](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
XLang | --- | --- | --- | [](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/)
| --- | --- | ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 354909)
Remaining Estimate: 0h
Time Spent: 10m
> Load balance bundle processing w/ multiple SDK workers
> ------------------------------------------------------
>
> Key: BEAM-8816
> URL: https://issues.apache.org/jira/browse/BEAM-8816
> Project: Beam
> Issue Type: Improvement
> Components: runner-core, runner-flink
> Affects Versions: 2.17.0
> Reporter: Thomas Weise
> Assignee: Thomas Weise
> Priority: Major
> Labels: portability
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We found skewed utilization of SDK workers causing excessive latency with
> Streaming/Python/Flink. (Remember that with Python, we need to execute
> multiple worker processes on a machine instead of relying on threads in a
> single worker, which requires the runner to make a decision to which worker
> to give a bundle for processing.)
> The Flink runner has knobs to influence the number of records per bundle and
> the maximum duration for a bundle. But since the runner does not understand
> the cost of individual records, it is possible for the duration of bundles to
> fluctuate significantly due to skew in processing time of individual records.
> And unless the bundle size is 1, multiple expensive records could be
> allocated to a single bundle before the cutoff time is reached. We notice
> this with a pipeline that executes models, but there are other use cases
> where the cost of individual records can vary significantly.
> Additionally, the Flink runner establishes the association between the
> subtask managing an executable stage and the SDK worker during
> initialization, lasting for the duration of the job. In other words, bundles
> for the same executable stage will always be sent to the same SDK worker.
> When the execution time skew is tied to specific keys (stateful processing),
> it further aggravates the issue.
> [https://lists.apache.org/thread.html/59c02d8b8ea849c158deb39ad9d83af4d8fcb56570501c7fe8f79bb2@%3Cdev.beam.apache.org%3E]
> Long term this problem can be addressed with SDF. Till then, an (optional)
> runner controlled balancing mechanism has shown to improve the performance in
> internal testing.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)