Github user StephanEwen commented on a diff in the pull request:
https://github.com/apache/flink/pull/2465#discussion_r79474627
--- Diff: docs/setup/config.md ---
@@ -169,58 +169,111 @@ Default value is the `akka.ask.timeout`.
These parameters configure the default HDFS used by Flink. Setups that do
not specify a HDFS configuration have to specify the full path to HDFS files
(`hdfs://address:port/path/to/files`) Files will also be written with default
HDFS parameters (block size, replication factor).
- `fs.hdfs.hadoopconf`: The absolute path to the Hadoop configuration
directory. The system will look for the "core-site.xml" and "hdfs-site.xml"
files in that directory (DEFAULT: null).
+
- `fs.hdfs.hdfsdefault`: The absolute path of Hadoop's own configuration
file "hdfs-default.xml" (DEFAULT: null).
+
- `fs.hdfs.hdfssite`: The absolute path of Hadoop's own configuration file
"hdfs-site.xml" (DEFAULT: null).
### JobManager & TaskManager
The following parameters configure Flink's JobManager and TaskManagers.
- `jobmanager.rpc.address`: The IP address of the JobManager, which is the
master/coordinator of the distributed system (DEFAULT: **localhost**).
+
- `jobmanager.rpc.port`: The port number of the JobManager (DEFAULT:
**6123**).
+
- `taskmanager.hostname`: The hostname of the network interface that the
TaskManager binds to. By default, the TaskManager searches for network
interfaces that can connect to the JobManager and other TaskManagers. This
option can be used to define a hostname if that strategy fails for some reason.
Because different TaskManagers need different values for this option, it
usually is specified in an additional non-shared TaskManager-specific config
file.
+
- `taskmanager.rpc.port`: The task manager's IPC port (DEFAULT: **0**,
which lets the OS choose a free port).
+
- `taskmanager.data.port`: The task manager's port used for data exchange
operations (DEFAULT: **0**, which lets the OS choose a free port).
+
- `jobmanager.heap.mb`: JVM heap size (in megabytes) for the JobManager
(DEFAULT: **256**).
+
- `taskmanager.heap.mb`: JVM heap size (in megabytes) for the
TaskManagers, which are the parallel workers of the system. In contrast to
Hadoop, Flink runs operators (e.g., join, aggregate) and user-defined functions
(e.g., Map, Reduce, CoGroup) inside the TaskManager (including
sorting/hashing/caching), so this value should be as large as possible
(DEFAULT: **512**). On YARN setups, this value is automatically configured to
the size of the TaskManager's YARN container, minus a certain tolerance value.
+
- `taskmanager.numberOfTaskSlots`: The number of parallel operator or user
function instances that a single TaskManager can run (DEFAULT: **1**). If this
value is larger than 1, a single TaskManager takes multiple instances of a
function or operator. That way, the TaskManager can utilize multiple CPU cores,
but at the same time, the available memory is divided between the different
operator or function instances. This value is typically proportional to the
number of physical CPU cores that the TaskManager's machine has (e.g., equal to
the number of cores, or half the number of cores).
+
- `taskmanager.tmp.dirs`: The directory for temporary files, or a list of
directories separated by the systems directory delimiter (for example ':'
(colon) on Linux/Unix). If multiple directories are specified, then the
temporary files will be distributed across the directories in a round robin
fashion. The I/O manager component will spawn one reading and one writing
thread per directory. A directory may be listed multiple times to have the I/O
manager use multiple threads for it (for example if it is physically stored on
a very fast disc or RAID) (DEFAULT: **The system's tmp dir**).
+
- `taskmanager.network.numberOfBuffers`: The number of buffers available
to the network stack. This number determines how many streaming data exchange
channels a TaskManager can have at the same time and how well buffered the
channels are. If a job is rejected or you get a warning that the system has not
enough buffers available, increase this value (DEFAULT: **2048**).
+
- `taskmanager.memory.size`: The amount of memory (in megabytes) that the
task manager reserves on the JVM's heap space for sorting, hash tables, and
caching of intermediate results. If unspecified (-1), the memory manager will
take a fixed ratio of the heap memory available to the JVM, as specified by
`taskmanager.memory.fraction`. (DEFAULT: **-1**)
+
- `taskmanager.memory.fraction`: The relative amount of memory that the
task manager reserves for sorting, hash tables, and caching of intermediate
results. For example, a value of 0.8 means that TaskManagers reserve 80% of the
JVM's heap space for internal data buffers, leaving 20% of the JVM's heap space
free for objects created by user-defined functions. (DEFAULT: **0.7**) This
parameter is only evaluated, if `taskmanager.memory.size` is not set.
+
- `taskmanager.debug.memory.startLogThread`: Causes the TaskManagers to
periodically log memory and Garbage collection statistics. The statistics
include current heap-, off-heap, and other memory pool utilization, as well as
the time spent on garbage collection, by heap memory pool.
+
- `taskmanager.debug.memory.logIntervalMs`: The interval (in milliseconds)
in which the TaskManagers log the memory and garbage collection statistics.
Only has an effect, if `taskmanager.debug.memory.startLogThread` is set to true.
+
- `blob.fetch.retries`: The number of retries for the TaskManager to
download BLOBs (such as JAR files) from the JobManager (DEFAULT: **50**).
+
- `blob.fetch.num-concurrent`: The number concurrent BLOB fetches (such as
JAR file downloads) that the JobManager serves (DEFAULT: **50**).
+
- `blob.fetch.backlog`: The maximum number of queued BLOB fetches (such as
JAR file downloads) that the JobManager allows (DEFAULT: **1000**).
-- `task.cancellation-interval`: Time interval between two successive task
cancellation attempts in milliseconds (DEFAULT: **30000**).
+- `task.cancellation-interval`: Time interval between two successive task
cancellation attempts in milliseconds (DEFAULT: **30000**).
### Distributed Coordination (via Akka)
- `akka.ask.timeout`: Timeout used for all futures and blocking Akka
calls. If Flink fails due to timeouts then you should try to increase this
value. Timeouts can be caused by slow machines or a congested network. The
timeout value requires a time-unit specifier (ms/s/min/h/d) (DEFAULT: **10 s**).
+
- `akka.lookup.timeout`: Timeout used for the lookup of the JobManager.
The timeout value has to contain a time-unit specifier (ms/s/min/h/d) (DEFAULT:
**10 s**).
+
- `akka.framesize`: Maximum size of messages which are sent between the
JobManager and the TaskManagers. If Flink fails because messages exceed this
limit, then you should increase it. The message size requires a size-unit
specifier (DEFAULT: **10485760b**).
+
- `akka.watch.heartbeat.interval`: Heartbeat interval for Akka's
DeathWatch mechanism to detect dead TaskManagers. If TaskManagers are wrongly
marked dead because of lost or delayed heartbeat messages, then you should
increase this value. A thorough description of Akka's DeathWatch can be found
[here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector)
(DEFAULT: **akka.ask.timeout/10**).
+
- `akka.watch.heartbeat.pause`: Acceptable heartbeat pause for Akka's
DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
thorough description of Akka's DeathWatch can be found
[here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector)
(DEFAULT: **akka.ask.timeout**).
+
- `akka.watch.threshold`: Threshold for the DeathWatch failure detector. A
low value is prone to false positives whereas a high value increases the time
to detect a dead TaskManager. A thorough description of Akka's DeathWatch can
be found
[here](http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#failure-detector)
(DEFAULT: **12**).
+
- `akka.transport.heartbeat.interval`: Heartbeat interval for Akka's
transport failure detector. Since Flink uses TCP, the detector is not
necessary. Therefore, the detector is disabled by setting the interval to a
very high value. In case you should need the transport failure detector, set
the interval to some reasonable value. The interval value requires a time-unit
specifier (ms/s/min/h/d) (DEFAULT: **1000 s**).
+
- `akka.transport.heartbeat.pause`: Acceptable heartbeat pause for Akka's
transport failure detector. Since Flink uses TCP, the detector is not
necessary. Therefore, the detector is disabled by setting the pause to a very
high value. In case you should need the transport failure detector, set the
pause to some reasonable value. The pause value requires a time-unit specifier
(ms/s/min/h/d) (DEFAULT: **6000 s**).
+
- `akka.transport.threshold`: Threshold for the transport failure
detector. Since Flink uses TCP, the detector is not necessary and, thus, the
threshold is set to a high value (DEFAULT: **300**).
+
- `akka.tcp.timeout`: Timeout for all outbound connections. If you should
experience problems with connecting to a TaskManager due to a slow network, you
should increase this value (DEFAULT: **akka.ask.timeout**).
+
- `akka.throughput`: Number of messages that are processed in a batch
before returning the thread to the pool. Low values denote a fair scheduling
whereas high values can increase the performance at the cost of unfairness
(DEFAULT: **15**).
+
- `akka.log.lifecycle.events`: Turns on the Akka's remote logging of
events. Set this value to 'true' in case of debugging (DEFAULT: **false**).
+
- `akka.startup-timeout`: Timeout after which the startup of a remote
component is considered being failed (DEFAULT: **akka.ask.timeout**).
+### Network communication (via Netty)
+
+- `taskmanager.net.num-arenas`: The number of Netty arenas (DEFAULT:
**taskmanager.numberOfTaskSlots**).
--- End diff --
I think there is no good reason to not have it aligned with the number of
threads. On the other hand, since server and client threads can be different,
may make sense to configure this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---