[GitHub] [flink] rmetzger commented on a change in pull request #15355: [FLINK-21076][docs] Add section about Adaptive Scheduler

GitBox Wed, 24 Mar 2021 12:18:05 -0700


rmetzger commented on a change in pull request #15355:
URL: https://github.com/apache/flink/pull/15355#discussion_r600803890




##########
File path: docs/content.zh/docs/deployment/elastic_scaling.md
##########
@@ -100,13 +110,40 @@ Since Reactive Mode is a new, experimental feature, not 
all features supported b
 - **Deployment is only supported as a standalone application deployment**. 
Active resource providers (such as native Kubernetes, YARN or Mesos) are 
explicitly not supported. Standalone session clusters are not supported either. 
The application deployment is limited to single job applications. 
 
   The only supported deployment options are [Standalone in Application 
Mode]({{< ref "docs/deployment/resource-providers/standalone/overview" 
>}}#application-mode) ([described](#getting-started) on this page), [Docker in 
Application Mode]({{< ref 
"docs/deployment/resource-providers/standalone/docker" 
>}}#application-mode-on-docker) and [Standalone Kubernetes Application 
Cluster]({{< ref "docs/deployment/resource-providers/standalone/kubernetes" 
>}}#deploy-application-cluster).
-- **Streaming jobs only**: The first version of Reactive Mode runs with 
streaming jobs only. When submitting a batch job, then the default scheduler 
will be used.
-- **No support for [local recovery]({{< ref 
"docs/ops/state/large_state_tuning">}}#task-local-recovery)**: Local recovery 
is a feature that schedules tasks to machines so that the state on that machine 
gets re-used if possible. The lack of this feature means that Reactive Mode 
will always need to download the entire state from the checkpoint storage.
-- **No support for local failover**: Local failover means that the scheduler 
is able to restart parts ("regions" in Flink's internals) of a failed job, 
instead of the entire job. This limitation impacts only recovery time of 
embarrassingly parallel jobs: Flink's default scheduler can restart failed 
parts, while Reactive Mode will restart the entire job.
-- **Limited integration with Flink's Web UI**: Reactive Mode allows that a 
job's parallelism can change over its lifetime. The web UI only shows the 
current parallelism the job.
-- **Limited Job metrics**: With the exception of `numRestarts` all 
[availability]({{< ref "docs/ops/metrics" >}}#availability) and 
[checkpointing]({{< ref "docs/ops/metrics" >}}#checkpointing) metrics with the 
`Job` scope are not working correctly.
 
+The [limitations of Adaptive Scheduler](#limitations-1) also apply to Reactive 
Mode.
+
+
+## Adaptive Scheduler
+
+{{< hint danger >}}
+Using Adaptive Scheduler directly (not through Reactive Mode) is only advised 
for advanced users.
+{{< /hint >}}
+
+Adaptive Scheduler is a scheduler that can adjust the parallelism of a job 
based on the available slots. On start up, it requests the number of slots 
needed based on the parallelisms configured by the user in the streaming job. 
If the number of slots offered is lower than the requested slots, Adaptive 
Scheduler will reduce the parallelism so that it can start executing the job 
(or fail if insufficient slots are available). In Reactive Mode (see above) the 
parallelism requested is conceptually set to infinity, letting the job always 
use as many resources as possible. You can also use Adaptive Scheduler without 
Reactive Mode, but there are some practical limitations:
+- If you are using Adaptive Scheduler on a session cluster, there are no 
guarantees regarding the distribution of slots between multiple running jobs in 
the same session.
+- An active resource manager (native Kubernetes, YARN, Mesos) will request 
TaskManagers until the parallelism requested by the job is fulfilled, 
potentially allocating a lot of resources.
 
+One benefit of Adpative Scheduler over the default scheduler is that it can 
handle TaskManager losses gracefully, since it would just scale down in these 
cases.
+
+### Usage
+
+The following configuration parameters need to be set:
+
+- `jobmanager.scheduler: adaptive`: Change from the default scheduler to 
adaptive scheduler
+- `cluster.declarative-resource-management.enabled` Declarative resource 
management must be enabled (enabled by default).
+
+Depending on your usage scenario, we also recommend adjusting the parallelism 
of the job you are submitting to the adaptive scheduler. The parallelism 
configured determines the number of slots Adaptive Scheduler will request.

Review comment:
       from the introduction of adaptive scheduler, it is probably clear what 
the parallelism does. I'll remove that sentence.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] rmetzger commented on a change in pull request #15355: [FLINK-21076][docs] Add section about Adaptive Scheduler

Reply via email to