[jira] [Created] (HDFS-10504) DFSClient filesBeingWritten memory leak when client gets RemoteException - could only be replicated to 0 nodes instead of minReplication (=1)

Seb Mo (JIRA) Wed, 08 Jun 2016 11:46:05 -0700

Seb Mo created HDFS-10504:
-----------------------------

             Summary: DFSClient filesBeingWritten memory leak when client gets 
RemoteException - could only be replicated to 0 nodes instead of minReplication 
(=1)
                 Key: HDFS-10504
                 URL: https://issues.apache.org/jira/browse/HDFS-10504
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client
    Affects Versions: 2.7.2
         Environment: linux
            Reporter: Seb Mo



I'm trying to migrate data from nfs to hdfs. I have about 2million files with 
small sizes. That takes about 4 hours in my env, but I randomly get an 
exception during migration. Got 12 of those during the test (stack below). 

Now when I'm getting the exception, I'm doing a sleep for one second, after I 
check if the file is there (api says yes, but it's reported size is zero 
bytes). So I'm removing the file, then start writing it again and at that point 
it succeeds. 

Here is the stack:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File xxx/xxx/xxx 
could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and 1 node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)

        at org.apache.hadoop.ipc.Client.call(Client.java:1475)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1459)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1255)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)



When I write I'm using the try with resource which should call close method on 
the FSDataOutputStream. This triggers the 
dfsClient.endFileLease(fileId) to be called which should remove the ref from:
DFSClient:
synchronized(filesBeingWritten) {
      filesBeingWritten.remove(inodeId);
      if (filesBeingWritten.isEmpty()) {
        lastLeaseRenewal = 0;
      }
    }


But when the process finishes, I get:

2016-06-07 22:26:54,734 - ERROR [Thread-3] 
(DFSClient.closeAllFilesBeingWritten:940) - Failed to close inode 1675022
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /xxx/xxx/xxx 
could only be replicated to 0 nodes instead of minReplication (=1).  There are 
1 datanode(s) running and 1 node(s) are excluded in this operation.
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)


Now, when there is no space on the datanode, I get this error a lot which 
causes my migration java client to die with OutOfMemory. The cause is 
DFSClient.filesBeingWritten taking almost 1GB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10504) DFSClient filesBeingWritten memory leak when client gets RemoteException - could only be replicated to 0 nodes instead of minReplication (=1)

Reply via email to