[flink] 06/14: [FLINK-12388][docs] Update the production readiness checklist

sewen Fri, 10 May 2019 16:34:46 -0700

This is an automated email from the ASF dual-hosted git repository.

sewen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git


commit 754cd71dd92dfcfff3e1ef23083790422188ce9e
Author: Seth Wiesman <[email protected]>
AuthorDate: Wed May 1 21:30:27 2019 -0500

    [FLINK-12388][docs] Update the production readiness checklist
    
    This closes #8330
---
 docs/ops/production_ready.md    | 87 +++++++++++++++-------------------------
 docs/ops/production_ready.zh.md | 88 +++++++++++++++--------------------------
 2 files changed, 63 insertions(+), 112 deletions(-)

diff --git a/docs/ops/production_ready.md b/docs/ops/production_ready.md
index 08fef2b..ef97173 100644
--- a/docs/ops/production_ready.md
+++ b/docs/ops/production_ready.md
@@ -22,79 +22,54 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+The production readiness checklist provides an overview of configuration 
options that should be carefully considered before bringing an Apache Flink job 
into production. 
+While the Flink community has attempted to provide sensible defaults for each 
configuration, it is important to review this list and ensure the options 
chosen are sufficient for your needs. 
+
 * ToC
 {:toc}
 
-## Production Readiness Checklist
-
-Purpose of this production readiness checklist is to provide a condensed 
overview of configuration options that are
-important and need **careful considerations** if you plan to bring your Flink 
job into **production**. For most of these options
-Flink provides out-of-the-box defaults to make usage and adoption of Flink 
easier. For many users and scenarios, those
-defaults are good starting points for development and completely sufficient 
for "one-shot" jobs. 
-
-However, once you are planning to bring a Flink application to production the 
requirements typically increase. For example,
-you want your job to be (re-)scalable and to have a good upgrade story for 
your job and new Flink versions.
-
-In the following, we present a collection of configuration options that you 
should check before your job goes into production.
-
-### Set maximum parallelism for operators explicitly
-
-Maximum parallelism is a configuration parameter that is newly introduced in 
Flink 1.2 and has important implications
-for the (re-)scalability of your Flink job. This parameter, which can be set 
on a per-job and/or per-operator granularity,
-determines the maximum parallelism to which you can scale operators. It is 
important to understand that (as of now) there
-is **no way to change** this parameter after your job has been started, except 
for restarting your job completely 
-from scratch (i.e. with a new state, and not from a previous 
checkpoint/savepoint). Even if Flink would provide some way
-to change maximum parallelism for existing savepoints in the future, you can 
already assume that for large states this is 
-likely a long running operation that you want to avoid. At this point, you 
might wonder why not just to use a very high
-value as default for this parameter. The reason behind this is that high 
maximum parallelism can have some impact on your
-application's performance and even state sizes, because Flink has to maintain 
certain metadata for its ability to rescale which
-can increase with the maximum parallelism. In general, you should choose a max 
parallelism that is high enough to fit your
-future needs in scalability, but keeping it as low as possible can give 
slightly better performance. In particular,
-a maximum parallelism higher that 128 will typically result in slightly bigger 
state snapshots from the keyed backends.
+### Set An Explicit Max Parallelism
 
-Notice that maximum parallelism must fulfill the following conditions:
+The max parallelism, set on a per-job and per-operator granularity, determines 
the maximum parallelism to which a stateful operator can scale.
+There is currently **no way to change** the maximum parallelism of an operator 
after a job has started without discarding that operators state. 
+The reason maximum parallelism exists, versus allowing stateful operators to 
be infinitely scalable, is that it has some impact on your application's 
performance and state size.
+Flink has to maintain specific metadata for its ability to rescale state which 
grows linearly with max parallelism.
+In general, you should choose max parallelism that is high enough to fit your 
future needs in scalability, while keeping it low enough to maintain reasonable 
performance.
 
-`0 < parallelism  <= max parallelism <= 2^15`
+{% panel **Note:** Maximum parallelism must fulfill the following conditions: 
`0 < parallelism  <= max parallelism <= 2^15` %}
 
-You can set the maximum parallelism by `setMaxParallelism(int 
maxparallelism)`. By default, Flink will choose the maximum
-parallelism as a function of the parallelism when the job is first started:
+You can explicitly set maximum parallelism by using `setMaxParallelism(int 
maxparallelism)`. 
+If no max parallelism is set Flink will decide using a function of the 
operators parallelism when the job is first started:
 
 - `128` : for all parallelism <= 128.
 - `MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15)` : for all 
parallelism > 128.
 
-### Set UUIDs for operators
+### Set UUIDs For All Operators
 
-As mentioned in the documentation for [savepoints]({{ site.baseurl 
}}/ops/state/savepoints.html), users should set uids for
-operators. Those operator uids are important for Flink's mapping of operator 
states to operators which, in turn, is 
-essential for savepoints. By default operator uids are generated by traversing 
the JobGraph and hashing certain operator 
-properties. While this is comfortable from a user perspective, it is also very 
fragile, as changes to the JobGraph (e.g.
-exchanging an operator) will result in new UUIDs. To establish a stable 
mapping, we need stable operator uids provided 
-by the user through `setUid(String uid)`.
+As mentioned in the documentation for [savepoints]({{ site.baseurl 
}}/ops/state/savepoints.html), users should set uids for each operator in their 
`DataStream`.
+Uids are necessary for Flink's mapping of operator states to operators which, 
in turn, is essential for savepoints.
+By default, operator uids are generated by traversing the JobGraph and hashing 
specific operator properties.
+While this is comfortable from a user perspective, it is also very fragile, as 
changes to the JobGraph (e.g., exchanging an operator) results in new UUIDs.
+To establish a stable mapping, we need stable operator uids provided by the 
user through `setUid(String uid)`.
 
-### Choice of state backend
+### Choose The Right State Backend
 
-Currently, Flink has the limitation that it can only restore the state from a 
savepoint for the same state backend that
-took the savepoint. For example, this means that we can not take a savepoint 
with a memory state backend, then change
-the job to use a RocksDB state backend and restore. While we are planning to 
make backends interoperable in the near
-future, they are not yet. This means you should carefully consider which 
backend you use for your job before going to
-production.
+Currently, Flink's savepoint binary format is state backend specific.
+A savepoint taken with one state backend cannot be restored using another, and 
you should carefully consider which backend you use before going to production.
 
-In general, we recommend using RocksDB because this is currently the only 
state backend that supports large states (i.e.
-state that exceeds the available main memory) and asynchronous snapshots. From 
our experience, asynchronous snapshots are
-very important for large states because they do not block the operators and 
Flink can write the snapshots without stopping 
-stream processing. However, RocksDB can have worse performance than, for 
example, the memory-based state backends. If
-you are sure that your state will never exceed main memory and blocking the 
stream processing to write it is not an issue,
-you **could consider** to not use the RocksDB backends. However, at this 
point, we **strongly recommend** using RocksDB
-for production.
+In general, we recommend avoiding `MemoryStateBackend` in production because 
it stores its snapshots inside the JobManager as opposed to persistent disk.
+When deciding between `FsStateBackend` and `RocksDB`, it is a choice between 
performance and scalability.
+`FsStateBackend` is very fast as each state access and update operates on 
objects on the Java heap; however, state size is limited by available memory 
within the cluster.
+On the other hand, `RocksDB` can scale based on available disk space and is 
the only state backend to support incremental snapshots.
+However, each state access and update requires (de-)serialization and 
potentially reading from disk which leads to average performance that is an 
order of magnitude slower than the memory state backends.
+Carefully read through the [state backend documentation]({{ site.baseurl 
}}/ops/state/state_backends.html) to fully understand the pros and cons of each 
option.
 
-### Config JobManager High Availability(HA)
+### Configure JobManager High Availability
 
-The JobManager coordinates every Flink deployment. It is responsible for both 
*scheduling* and *resource management*.
+The JobManager serves as a central coordinator for each Flink deployment, 
being responsible for both scheduling and resource management of the cluster.
+It is a single point of failure within the cluster, and if it crashes, no new 
jobs can be submitted, and running applications will fail. 
 
-By default, there is a single JobManager instance per Flink cluster. This 
creates a *single point of failure* (SPOF): 
-if the JobManager crashes, no new programs can be submitted and running 
programs fail.
+Configuring [High Availability]({{ site.baseurl 
}}/ops/jobmanager_high_availability.html), in conjunction with Apache 
Zookeeper, allows for a swift recovery and is highly recommended for production 
setups. 
 
-With JobManager High Availability, you can recover from JobManager failures 
and thereby eliminate the *SPOF*. 
-We **strongly recommend** you configure [high availability]({{ site.baseurl 
}}/ops/jobmanager_high_availability.html) for production.
 
 {% top %}
diff --git a/docs/ops/production_ready.zh.md b/docs/ops/production_ready.zh.md
index d5f7ee1..e5cff46 100644
--- a/docs/ops/production_ready.zh.md
+++ b/docs/ops/production_ready.zh.md
@@ -22,79 +22,55 @@ specific language governing permissions and limitations
 under the License.
 -->
 
+The production readiness checklist provides an overview of configuration 
options that should be carefully considered before bringing an Apache Flink job 
into production. 
+While the Flink community has attempted to provide sensible defaults for each 
configuration, it is important to review this list and ensure the options 
chosen are sufficient for your needs. 
+
 * ToC
 {:toc}
 
-## Production Readiness Checklist
-
-Purpose of this production readiness checklist is to provide a condensed 
overview of configuration options that are
-important and need **careful considerations** if you plan to bring your Flink 
job into **production**. For most of these options
-Flink provides out-of-the-box defaults to make usage and adoption of Flink 
easier. For many users and scenarios, those
-defaults are good starting points for development and completely sufficient 
for "one-shot" jobs. 
-
-However, once you are planning to bring a Flink application to production the 
requirements typically increase. For example,
-you want your job to be (re-)scalable and to have a good upgrade story for 
your job and new Flink versions.
-
-In the following, we present a collection of configuration options that you 
should check before your job goes into production.
+### Set An Explicit Max Parallelism
 
-### Set maximum parallelism for operators explicitly
+The max parallelism, set on a per-job and per-operator granularity, determines 
the maximum parallelism to which a stateful operator can scale.
+There is currently **no way to change** the maximum parallelism of an operator 
after a job has started without discarding that operators state. 
+The reason maximum parallelism exists, versus allowing stateful operators to 
be infinitely scalable, is that it has some impact on your application's 
performance and state size.
+Flink has to maintain specific metadata for its ability to rescale state which 
grows linearly with max parallelism.
+In general, you should choose max parallelism that is high enough to fit your 
future needs in scalability, while keeping it low enough to maintain reasonable 
performance.
 
-Maximum parallelism is a configuration parameter that is newly introduced in 
Flink 1.2 and has important implications
-for the (re-)scalability of your Flink job. This parameter, which can be set 
on a per-job and/or per-operator granularity,
-determines the maximum parallelism to which you can scale operators. It is 
important to understand that (as of now) there
-is **no way to change** this parameter after your job has been started, except 
for restarting your job completely 
-from scratch (i.e. with a new state, and not from a previous 
checkpoint/savepoint). Even if Flink would provide some way
-to change maximum parallelism for existing savepoints in the future, you can 
already assume that for large states this is 
-likely a long running operation that you want to avoid. At this point, you 
might wonder why not just to use a very high
-value as default for this parameter. The reason behind this is that high 
maximum parallelism can have some impact on your
-application's performance and even state sizes, because Flink has to maintain 
certain metadata for its ability to rescale which
-can increase with the maximum parallelism. In general, you should choose a max 
parallelism that is high enough to fit your
-future needs in scalability, but keeping it as low as possible can give 
slightly better performance. In particular,
-a maximum parallelism higher that 128 will typically result in slightly bigger 
state snapshots from the keyed backends.
+{% panel **Note:** Maximum parallelism must fulfill the following conditions: 
`0 < parallelism  <= max parallelism <= 2^15` %}
 
-Notice that maximum parallelism must fulfill the following conditions:
-
-`0 < parallelism  <= max parallelism <= 2^15`
-
-You can set the maximum parallelism by `setMaxParallelism(int 
maxparallelism)`. By default, Flink will choose the maximum
-parallelism as a function of the parallelism when the job is first started:
+You can explicitly set maximum parallelism by using `setMaxParallelism(int 
maxparallelism)`. 
+If no max parallelism is set Flink will decide using a function of the 
operators parallelism when the job is first started:
 
 - `128` : for all parallelism <= 128.
 - `MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15)` : for all 
parallelism > 128.
 
-### Set UUIDs for operators
+### Set UUIDs For All Operators
 
-As mentioned in the documentation for [savepoints]({{ site.baseurl 
}}/ops/state/savepoints.html), users should set uids for
-operators. Those operator uids are important for Flink's mapping of operator 
states to operators which, in turn, is 
-essential for savepoints. By default operator uids are generated by traversing 
the JobGraph and hashing certain operator 
-properties. While this is comfortable from a user perspective, it is also very 
fragile, as changes to the JobGraph (e.g.
-exchanging an operator) will result in new UUIDs. To establish a stable 
mapping, we need stable operator uids provided 
-by the user through `setUid(String uid)`.
+As mentioned in the documentation for [savepoints]({{ site.baseurl 
}}/ops/state/savepoints.html), users should set uids for each operator in their 
`DataStream`.
+Uids are necessary for Flink's mapping of operator states to operators which, 
in turn, is essential for savepoints.
+By default, operator uids are generated by traversing the JobGraph and hashing 
specific operator properties.
+While this is comfortable from a user perspective, it is also very fragile, as 
changes to the JobGraph (e.g., exchanging an operator) results in new UUIDs.
+To establish a stable mapping, we need stable operator uids provided by the 
user through `setUid(String uid)`.
 
-### Choice of state backend
+### Choose The Right State Backend
 
-Currently, Flink has the limitation that it can only restore the state from a 
savepoint for the same state backend that
-took the savepoint. For example, this means that we can not take a savepoint 
with a memory state backend, then change
-the job to use a RocksDB state backend and restore. While we are planning to 
make backends interoperable in the near
-future, they are not yet. This means you should carefully consider which 
backend you use for your job before going to
-production.
+Currently, Flink's savepoint binary format is state backend specific.
+A savepoint taken with one state backend cannot be restored using another, and 
you should carefully consider which backend you use before going to production.
 
-In general, we recommend using RocksDB because this is currently the only 
state backend that supports large states (i.e.
-state that exceeds the available main memory) and asynchronous snapshots. From 
our experience, asynchronous snapshots are
-very important for large states because they do not block the operators and 
Flink can write the snapshots without stopping 
-stream processing. However, RocksDB can have worse performance than, for 
example, the memory-based state backends. If
-you are sure that your state will never exceed main memory and blocking the 
stream processing to write it is not an issue,
-you **could consider** to not use the RocksDB backends. However, at this 
point, we **strongly recommend** using RocksDB
-for production.
+In general, we recommend avoiding `MemoryStateBackend` in production because 
it stores its snapshots inside the JobManager as opposed to persistent disk.
+When deciding between `FsStateBackend` and `RocksDB`, it is a choice between 
performance and scalability.
+`FsStateBackend` is very fast as each state access and update operates on 
objects on the Java heap; however, state size is limited by available memory 
within the cluster.
+On the other hand, `RocksDB` can scale based on available disk space and is 
the only state backend to support incremental snapshots.
+However, each state access and update requires (de-)serialization and 
potentially reading from disk which leads to average performance that is an 
order of magnitude slower than the memory state backends.
+Carefully read through the [state backend documentation]({{ site.baseurl 
}}/ops/state/state_backends.html) to fully understand the pros and cons of each 
option.
 
-### Config JobManager High Availability(HA)
+### Configure JobManager High Availability
 
-The JobManager coordinates every Flink deployment. It is responsible for both 
*scheduling* and *resource management*.
+The JobManager serves as a central coordinator for each Flink deployment, 
being responsible for both scheduling and resource management of the cluster.
+It is a single point of failure within the cluster, and if it crashes, no new 
jobs can be submitted, and running applications will fail. 
 
-By default, there is a single JobManager instance per Flink cluster. This 
creates a *single point of failure* (SPOF): 
-if the JobManager crashes, no new programs can be submitted and running 
programs fail.
+Configuring [High Availability]({{ site.baseurl 
}}/ops/jobmanager_high_availability.html), in conjunction with Apache 
Zookeeper, allows for a swift recovery and is highly recommended for production 
setups. 
 
-With JobManager High Availability, you can recover from JobManager failures 
and thereby eliminate the *SPOF*. 
-We **strongly recommend** you configure [high availability]({{ site.baseurl 
}}/ops/jobmanager_high_availability.html) for production.
 
 {% top %}
+

[flink] 06/14: [FLINK-12388][docs] Update the production readiness checklist

Reply via email to