[ 
https://issues.apache.org/jira/browse/HDFS-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522419#comment-16522419
 ] 

Zsolt Venczel commented on HDFS-13121:
--------------------------------------

Thank you very much for taking a look [~jojochuang]!

The current solution throws an IOException at BlockReaderFactory.java#614 which 
is handled at BlockReaderFactory.java#631 where in case of this problem the 
following is logged:
{code:java}
2018-06-25 16:58:09,777 [main] WARN  impl.BlockReaderFactory 
(BlockReaderFactory.java:requestFileDescriptors(631)) - 
BlockReaderFactory(fileName=null, 
block=BP-778337774-127.0.1.1-1529938688855:blk_1073741825_1001): error creating 
ShortCircuitReplica.
java.io.IOException: the datanode 
DatanodeInfoWithStorage[127.0.0.1:41377,DS-83cd5e5c-95bb-4b16-a438-33dfc05608d8,DISK]
 failed to pass a file descriptor (might have reached open file limit).
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:614)
        at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:553)
        at 
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache.testRequestFileDescriptorsWhenULimit(TestShortCircuitCache.java:904)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
        at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
        at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
        at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
        at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
        at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
        at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413)
{code}
The warning message also states that the box "might have reached open file 
limit" as currently I'm not sure how to tell exactly why the native code failed 
to acquire new file descriptors.

There's also an explanation, provided by HDFS-5810, in this scenario why a null 
is returned by the function:
{code}
        // This indicates an error reading from disk, or a format error.  Since
        // it's not a socket communication problem, we return null rather than
        // throwing an exception.
{code}
This approach in my understanding is aligned with the defined workflow 
describing short circuit replication handling.

Do you have anything in mind how to improve the handling of such a scenario?

> NPE when request file descriptors when SC read
> ----------------------------------------------
>
>                 Key: HDFS-13121
>                 URL: https://issues.apache.org/jira/browse/HDFS-13121
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Gang Xie
>            Assignee: Zsolt Venczel
>            Priority: Minor
>         Attachments: HDFS-13121.01.patch, HDFS-13121.02.patch, 
> HDFS-13121.03.patch, HDFS-13121.04.patch, test-only.patch
>
>
> Recently, we hit an issue that the DFSClient throws NPE. The case is that, 
> the app process exceeds the limit of the max open file. In the case, the 
> libhadoop never throw and exception but return null to the request of fds. 
> But requestFileDescriptors use the returned fds directly without any check 
> and then NPE. 
>  
> We need add a sanity check here of null pointer.
>  
> private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer,
>  Slot slot) throws IOException {
>  ShortCircuitCache cache = clientContext.getShortCircuitCache();
>  final DataOutputStream out =
>  new DataOutputStream(new BufferedOutputStream(peer.getOutputStream()));
>  SlotId slotId = slot == null ? null : slot.getSlotId();
>  new Sender(out).requestShortCircuitFds(block, token, slotId, 1,
>  failureInjector.getSupportsReceiptVerification());
>  DataInputStream in = new DataInputStream(peer.getInputStream());
>  BlockOpResponseProto resp = BlockOpResponseProto.parseFrom(
>  PBHelperClient.vintPrefixed(in));
>  DomainSocket sock = peer.getDomainSocket();
>  failureInjector.injectRequestFileDescriptorsFailure();
>  switch (resp.getStatus()) {
>  case SUCCESS:
>  byte buf[] = new byte[1];
>  FileInputStream[] fis = new FileInputStream[2];
>  {color:#d04437}sock.recvFileInputStreams(fis, buf, 0, buf.length);{color}
>  ShortCircuitReplica replica = null;
>  try {
>  ExtendedBlockId key =
>  new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId());
>  if (buf[0] == USE_RECEIPT_VERIFICATION.getNumber()) {
>  LOG.trace("Sending receipt verification byte for slot {}", slot);
>  sock.getOutputStream().write(0);
>  }
>  {color:#d04437}replica = new ShortCircuitReplica(key, fis[0], fis[1], 
> cache,{color}
> {color:#d04437} Time.monotonicNow(), slot);{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to