Github user alpinegizmo commented on a diff in the pull request:
https://github.com/apache/flink/pull/3259#discussion_r99376685
--- Diff: docs/ops/production_ready.md ---
@@ -0,0 +1,88 @@
+---
+title: "Production Readiness Checklist"
+nav-parent_id: setup
+nav-pos: 20
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* ToC
+{:toc}
+
+## Production Readiness Checklist
+
+Purpose of this production readiness checklist is to provide a condensed
overview of configuration options that are
+important and need **careful considerations** if you plan to bring your
Flink job into **production**. For most of these options
+Flink provides out-of-the-box defaults to make usage and adoption of Flink
easier. For many users and scenarios, those
+defaults are good starting points for development and completely
sufficient for "one-shot" jobs.
+
+However, once you are planning to bring a Flink appplication to production
the requirements typically increase. For example,
+you want your job to be (re-)scalable and to have a good upgrade story for
your job and new Flink versions.
+
+In the following, we present a collection of configuration options that
you should check before your job goes into production.
+
+### Set maximum parallelism for operators explicitly
+
+Maximum parallelism is a configuration parameter that is newly introduced
in Flink 1.2 and has important implications
+for the (re-)scalability of your Flink job. This parameter, which can be
set on a per-job and/or per-operator granularity,
+determines the maximum parallelism to which you can scale operators. It is
important to understand that (as of now) there
+is **now way to increase** this parameter after your job was initially
started, except for restarting your job completely
+from scratch (i.e. with a new state, and not from a previous
checkpoint/savepoint). Even if Flink would provide some way
+to change maximum parallelism for existing savepoints in the future, you
can already assume that for large states this is
+likely a long running operation that you want to avoid. At this point, you
might wonder why not just to use a very high
+value as default for this parameter. The reason behind this is that high
maximum parallelism can have some impact on your
+applications performance and even state sizes, because Flink has to
maintain certain meta data for it's ability to rescale which
+can increase with the maximum parallelism. In general, you should chose a
max parallelism that is high enough to fit your
+future needs in scalability, but keeping it as low as possible can give
slightly better performance. In particular,
+a maximum parallelism higher that 128 will typically result in slightly
bigger state snapshots from the keyed backends.
+
+Notice that maximum parallelism must fulfill the following conditions:
+
+`0 < parallelism <= max parallelism <= 2^15`
+
+You can set the maximum parallelism by `setMaxParallelism(int
maxparallelism)`. By default, Flink will chose the maximum
+parallelism as a function of the parallelism when the job is first started:
+
+- `128` : for all parallelism <= 128.
+- `MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15)` : for all
parallelism > 128.
+
+### Set UUIDs for operators
+
+As mentioned in the documentation for [savepoints]({{ site.baseurl
}}/setup/savepoints.html, users should set uids for
+operators. Those operator uids are important for Flink's mapping of
operator states to operators which, in turn, is
+essential for savepoints. By default operator uids are generated by
traversing the JobGraph and hashing certain operator
+properties. While this is comfortable from a user perspective, it is also
very fragile to changes on the JobGraph (e.g.
+if you want to exchange an operator). To establish a stable mapping, we
need stable operator uids provided by the user
+through `setUid(String uid)`.
+
+### Choice of state backend
+
+Currently, Flink has the limitation that it can only restore the state
from a savepoint for the same state backend that
+took the savepoint. For example, this means that we can not take a
savepoint with a memory state backend, then change
+the job to use RocksDB state backend and restore. While we are planning to
make backends interoperable in the near
--- End diff --
to use a RocksDB
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---