[jira] [Commented] (HDFS-10504) DFSClient filesBeingWritten memory leak when client gets RemoteException - could only be replicated to 0 nodes instead of minReplication (=1)

Seb Mo (JIRA) Wed, 08 Jun 2016 13:34:37 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321392#comment-15321392
 ]


Seb Mo commented on HDFS-10504:
-------------------------------

Thank you Arpit for your feedback. 

I did not complain about the temporary error, just the fact that the client 
holds on to resources when this happen, so it can cause the application that's 
using the client to die with Out of memory. I see this as a bug in the current 
hadoop client code.  

> DFSClient filesBeingWritten memory leak when client gets RemoteException - 
> could only be replicated to 0 nodes instead of minReplication (=1)
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10504
>                 URL: https://issues.apache.org/jira/browse/HDFS-10504
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.7.2
>         Environment: linux
>            Reporter: Seb Mo
>
> I'm trying to migrate data from nfs to hdfs. I have about 2million files with 
> small sizes. That takes about 4 hours in my env, but I randomly get an 
> exception during migration. Got 12 of those during the test (stack below). 
> Now when I'm getting the exception, I'm doing a sleep for one second, after I 
> check if the file is there (api says yes, but it's reported size is zero 
> bytes). So I'm removing the file, then start writing it again and at that 
> point it succeeds. 
> Here is the stack:
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File xxx/xxx/xxx 
> could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and 1 node(s) are excluded in this operation.
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1475)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>       at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
>       at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy11.addBlock(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1459)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1255)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
> When I write I'm using the try with resource which should call close method 
> on the FSDataOutputStream. This triggers the 
> dfsClient.endFileLease(fileId) to be called which should remove the ref from:
> DFSClient:
> synchronized(filesBeingWritten) {
>       filesBeingWritten.remove(inodeId);
>       if (filesBeingWritten.isEmpty()) {
>         lastLeaseRenewal = 0;
>       }
>     }
> But when the process finishes, I get:
> 2016-06-07 22:26:54,734 - ERROR [Thread-3] 
> (DFSClient.closeAllFilesBeingWritten:940) - Failed to close inode 1675022
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /xxx/xxx/xxx 
> could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and 1 node(s) are excluded in this operation.
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1592)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3158)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3082)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:822)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
> Now, when there is no space on the datanode, I get this error a lot which 
> causes my migration java client to die with OutOfMemory. The cause is 
> DFSClient.filesBeingWritten taking almost 1GB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10504) DFSClient filesBeingWritten memory leak when client gets RemoteException - could only be replicated to 0 nodes instead of minReplication (=1)

Reply via email to