Greg Hogan created FLINK-2865:
---------------------------------
Summary: OutOfMemory error (Direct buffer memory)
Key: FLINK-2865
URL: https://issues.apache.org/jira/browse/FLINK-2865
Project: Flink
Issue Type: Bug
Components: Distributed Runtime
Affects Versions: 0.10
Reporter: Greg Hogan
I see the following TaskManager error when using off-heap memory and a
relatively high number of network buffers. Setting
{{taskmanager.memory.off-heap: false}} or halving the number of network buffers
(6 GB instead of 12 GB) results in a successful start.
{noformat}
18:17:25,912 WARN org.apache.hadoop.util.NativeCodeLoader
- Unable to load native-hadoop library for your platform... using builtin-java
classes where applicable
18:17:26,024 INFO org.apache.flink.runtime.taskmanager.TaskManager
-
--------------------------------------------------------------------------------
18:17:26,024 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Starting TaskManager (Version: 0.10-SNAPSHOT, Rev:d047ddb, Date:18.10.2015
@ 08:54:59 UTC)
18:17:26,025 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Current user: ec2-user
18:17:26,025 INFO org.apache.flink.runtime.taskmanager.TaskManager
- JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.60-b23
18:17:26,025 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Maximum heap size: 5104 MiBytes
18:17:26,025 INFO org.apache.flink.runtime.taskmanager.TaskManager
- JAVA_HOME: /usr/java/latest
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Hadoop version: 2.3.0
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
- JVM Options:
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
- -Xms5325M
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
- -Xmx5325M
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
- -XX:MaxDirectMemorySize=53248M
18:17:26,026 INFO org.apache.flink.runtime.taskmanager.TaskManager
-
-Dlog.file=/home/ec2-user/flink/log/flink-ec2-user-taskmanager-0-ip-10-0-98-3.log
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- -Dlog4j.configuration=file:/home/ec2-user/flink/conf/log4j.properties
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- -Dlogback.configurationFile=file:/home/ec2-user/flink/conf/logback.xml
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Program Arguments:
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- --configDir
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- /home/ec2-user/flink/conf
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- --streamingMode
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
- batch
18:17:26,027 INFO org.apache.flink.runtime.taskmanager.TaskManager
-
--------------------------------------------------------------------------------
18:17:26,033 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Maximum number of open file descriptors is 1048576
18:17:26,051 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Loading configuration from /home/ec2-user/flink/conf
18:17:26,079 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Security is not enabled. Starting non-authenticated TaskManager.
18:17:26,094 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils
- Trying to select the network interface and address to use by connecting to
the leading JobManager.
18:17:26,094 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils
- TaskManager will try to connect for 10000 milliseconds before falling back
to heuristics
18:17:26,097 INFO org.apache.flink.runtime.net.ConnectionUtils
- Retrieved new target address /127.0.0.1:6123.
18:17:26,461 INFO org.apache.flink.runtime.taskmanager.TaskManager
- TaskManager will use hostname/address 'ip-10-0-98-3' (10.0.98.3) for
communication.
18:17:26,462 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Starting TaskManager in streaming mode BATCH_ONLY
18:17:26,462 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Starting TaskManager actor system at 10.0.98.3:0
18:17:26,735 INFO akka.event.slf4j.Slf4jLogger
- Slf4jLogger started
18:17:26,767 INFO Remoting
- Starting remoting
18:17:26,877 INFO Remoting
- Remoting started; listening on addresses :[akka.tcp://[email protected]:47484]
18:17:26,881 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Starting TaskManager actor
18:17:26,925 INFO org.apache.flink.runtime.io.network.netty.NettyConfig
- NettyConfig [server address: ip-10-0-98-3/10.0.98.3, server port: 45728,
memory segment size (bytes): 32768, transport type: NIO, number of server
threads: 0 (use Netty's default), number of client threads: 0 (use Netty's
default), server connect backlog: 0 (use Netty's default), client connect
timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
18:17:26,927 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Messages between TaskManager and JobManager have a max timeout of 100000
milliseconds
18:17:26,931 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Temporary file directory '/volumes/xvdb/tmp': total 319 GB, usable 319 GB
(100.00% usable)
18:17:26,931 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Temporary file directory '/volumes/xvdc/tmp': total 319 GB, usable 319 GB
(100.00% usable)
18:17:32,194 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool
- Allocated 12288 MB for network buffer pool (number of memory segments:
393216, bytes per segment: 32768).
18:17:32,195 INFO org.apache.flink.runtime.taskmanager.TaskManager
- Using 0.9 of the maximum memory size for Flink managed off-heap memory
(45940 MB).
18:17:50,371 ERROR org.apache.flink.runtime.taskmanager.TaskManager
- Error while starting up taskManager
java.lang.Exception: OutOfMemory error (Direct buffer memory) while allocating
the TaskManager off-heap memory (48172092966 bytes). Try increasing the maximum
direct memory (-XX:MaxDirectMemorySize)
at
org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1633)
at
org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1460)
at
org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1325)
at
org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1235)
at
org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at
org.apache.flink.runtime.memory.MemoryManager$HybridOffHeapMemoryPool.<init>(MemoryManager.java:661)
at
org.apache.flink.runtime.memory.MemoryManager.<init>(MemoryManager.java:166)
at
org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1618)
... 4 more
18:17:50,374 ERROR org.apache.flink.runtime.taskmanager.TaskManager
- Failed to run TaskManager.
java.lang.Exception: OutOfMemory error (Direct buffer memory) while allocating
the TaskManager off-heap memory (48172092966 bytes). Try increasing the maximum
direct memory (-XX:MaxDirectMemorySize)
at
org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1633)
at
org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1460)
at
org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1325)
at
org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1235)
at
org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at
org.apache.flink.runtime.memory.MemoryManager$HybridOffHeapMemoryPool.<init>(MemoryManager.java:661)
at
org.apache.flink.runtime.memory.MemoryManager.<init>(MemoryManager.java:166)
at
org.apache.flink.runtime.taskmanager.TaskManager$.startTaskManagerComponentsAndActor(TaskManager.scala:1618)
... 4 more
{noformat}
{noformat}
################################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################
jobmanager.web.history: 50
taskmanager.debug.memory.startLogThread: true
taskmanager.debug.memory.logIntervalMs: 1000
taskmanager.memory.fraction: 0.9
taskmanager.memory.off-heap: true
taskmanager.runtime.hashjoin-bloom-filters: true
taskmanager.runtime.max-fan: 1024
#==============================================================================
# Common
#==============================================================================
# The host on which the JobManager runs. Only used in non-high-availability
mode.
# The JobManager process will use this hostname to bind the listening servers
to.
# The TaskManagers will try to connect to the JobManager on that host.
jobmanager.rpc.address: localhost
# The port where the JobManager's main actor system listens for messages.
jobmanager.rpc.port: 6123
# The heap size for the JobManager JVM
jobmanager.heap.mb: 1024
# The heap size for the TaskManager JVM
taskmanager.heap.mb: 53248
# The number of task slots that each TaskManager offers. Each slot runs one
parallel pipeline.
taskmanager.numberOfTaskSlots: 32
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 32
#==============================================================================
# Web Frontend
#==============================================================================
# The port under which the web-based runtime monitor listens.
# A value of -1 deactivates the web server.
jobmanager.web.port: 8081
# The port uder which the standalone web client
# (for job upload and submit) listens.
webclient.port: 8080
# Temporary: Uncomment this to be able to use the new web frontend
jobmanager.new-web-frontend: true
#==============================================================================
# Streaming state checkpointing
#==============================================================================
# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends: jobmanager, filesystem
state.backend: jobmanager
# Directory for storing checkpoints in a flink supported filesystem
# Note: State backend must be accessible from the JobManager, use file://
# only for local setups.
#
# state.backend.fs.checkpointdir: hdfs://checkpoints
#==============================================================================
# Advanced
#==============================================================================
# The number of buffers for the network stack.
taskmanager.network.numberOfBuffers: 393216
# Directories for temporary files.
#
# Add a delimited list for multiple directories, using the system directory
# delimiter (colon ':' on unix) or a comma, e.g.:
# /data1/tmp:/data2/tmp:/data3/tmp
#
# Note: Each directory entry is read from and written to by a different I/O
# thread. You can include the same directory multiple times in order to create
# multiple I/O threads against that directory. This is for example relevant for
# high-throughput RAIDs.
#
# If not specified, the system-specific Java temporary directory (java.io.tmpdir
# property) is taken.
taskmanager.tmp.dirs: /volumes/xvdb/tmp:/volumes/xvdc/tmp
# Path to the Hadoop configuration directory.
#
# This configuration is used when writing into HDFS. Unless specified otherwise,
# HDFS file creation will use HDFS default settings with respect to block-size,
# replication factor, etc.
#
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
# fs.hdfs.hadoopconf: /path/to/hadoop/conf/
#==============================================================================
# High Availability
#==============================================================================
# The list of ZooKepper quorum peers that coordinate the high-availability
# setup. This must be a list of the form
# "host_1[:peerPort[:leaderPort]],host_2[:peerPort[:leaderPort]],..."
#
# recovery.mode: zookeeper
#
# ha.zookeeper.quorum: localhost
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)