[
https://issues.apache.org/jira/browse/HADOOP-19061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813527#comment-17813527
]
ASF GitHub Bot commented on HADOOP-19061:
-----------------------------------------
xinglin commented on PR #6519:
URL: https://github.com/apache/hadoop/pull/6519#issuecomment-1923208190
> > The new code would remove that bad connection object and a new good one
will be created next time.
> Do we log the bad connection which was cleaned up?
>
> What happens if there is an OOM when creating the new connection?
Wondering is we can get into a live-lock.
>
> Please do excuse the many questions, I'm not familiar with this code so
this is to help my understanding.
made some code change. prefer to have a single try-catch block to avoid code
duplication.
Totally valid question and i'd have to chat with Gobblin team to understand
more about their use case and confirm whether this PR would indeed mitigate
their job hangings.
If we remain low on memory, we will keep getting OOM and yes, we will get
into live-lock. However, the difference is in the original implementation, if
we ever hit a single OOM, we will end up with a bad connection for this JVM
permanently and all RPC requests to that destination will be blocked forever.
For gobblin, they'd have to manually delete that JVM (and recreate a new one).
With the new code, we don't have that problem and can tolerate certain degree
of OOMs. And the expectation is we won't be hitting OOM all the time. If we are
hitting OOM all the time, we should look for memory leak or bump memory/heap
size.
> Capture exception in rpcRequestSender.start() in IPC.Connection.run()
> ---------------------------------------------------------------------
>
> Key: HADOOP-19061
> URL: https://issues.apache.org/jira/browse/HADOOP-19061
> Project: Hadoop Common
> Issue Type: Bug
> Components: ipc
> Affects Versions: 3.5.0
> Reporter: Xing Lin
> Assignee: Xing Lin
> Priority: Major
> Labels: pull-request-available
>
> rpcRequestThread.start() can fail due to OOM. This will immediately crash the
> Connection thread, without removing itself from the connections pool. Then
> for all following getConnection(remoteid), we will get this bad connection
> object and all rpc requests will be hanging, because this is a bad connection
> object, without threads being properly running (Neither Connection or
> Connection.rpcRequestSender thread is running due to OOM.).
> In this PR, we moved the rpcRequestThread.start() to be within the
> try{}-catch{} block, to capture OOM from rpcRequestThread.start() and proper
> cleaning is followed if we hit OOM.
> {code:java}
> IPC.Connection.run()
> @Override
> public void run() {
> // Don't start the ipc parameter sending thread until we start this
> // thread, because the shutdown logic only gets triggered if this
> // thread is started.
> rpcRequestThread.start();
> if (LOG.isDebugEnabled())
> LOG.debug(getName() + ": starting, having connections "
> + connections.size());
> try {
> while (waitForWork()) {//wait here for work - read or close connection
> receiveRpcResponse();
> }
> } catch (Throwable t) {
> // This truly is unexpected, since we catch IOException in
> receiveResponse
> // -- this is only to be really sure that we don't leave a client
> hanging
> // forever.
> LOG.warn("Unexpected error reading responses on connection " + this,
> t);
> markClosed(new IOException("Error reading responses", t));
> }{code}
> Because there is no rpcRequestSender thread consuming the rpcRequestQueue,
> all rpc request enqueue operations for this connection will be blocked and
> will be hanging at this while loop forever during sendRpcRequest().
> {code:java}
> while (!shouldCloseConnection.get()) {
> if (rpcRequestQueue.offer(Pair.of(call, buf), 1, TimeUnit.SECONDS)) {
> break;
> }
> }{code}
> OOM exception in starting the rpcRequestSender thread.
> {code:java}
> Exception in thread "IPC Client (1664093259) connection to
> nn01.grid.linkedin.com/IP-Address:portNum from kafkaetl"
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1034)
> {code}
> Multiple threads blocked by queue.offer(). and we don't found any "IPC
> Client" or "IPC Parameter Sending Thread" in thread dump.
> {code:java}
> Thread 2156123: (state = BLOCKED)
> - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may be imprecise)
> - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long)
> @bci=20, line=215 (Compiled frame)
> -
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferQueue$QNode,
> java.lang.Object, boolean, long) @bci=156, line=764 (Compiled frame)
> -
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(java.lang.Object,
> boolean, long) @bci=148, line=695 (Compiled frame)
> - java.util.concurrent.SynchronousQueue.offer(java.lang.Object, long,
> java.util.concurrent.TimeUnit) @bci=24, line=895 (Compiled frame)
> -
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(org.apache.hadoop.ipc.Client$Call)
> @bci=88, line=1134 (Compiled frame)
> - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind,
> org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId,
> int, java.util.concurrent.atomic.AtomicBoolean,
> org.apache.hadoop.ipc.AlignmentContext) @bci=36, line=1402 (Interpreted frame)
> - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind,
> org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId,
> java.util.concurrent.atomic.AtomicBoolean,
> org.apache.hadoop.ipc.AlignmentContext) @bci=9, line=1349 (Compiled frame)
> - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object,
> java.lang.reflect.Method, java.lang.Object[]) @bci=248, line=230 (Compiled
> frame)
> - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object,
> java.lang.reflect.Method, java.lang.Object[]) @bci=4, line=118 (Compiled
> frame)
> - com.sun.proxy.$Proxy11.getBlockLocations({code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]