[
https://issues.apache.org/jira/browse/HDFS-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522419#comment-16522419
]
Zsolt Venczel commented on HDFS-13121:
--------------------------------------
Thank you very much for taking a look [~jojochuang]!
The current solution throws an IOException at BlockReaderFactory.java#614 which
is handled at BlockReaderFactory.java#631 where in case of this problem the
following is logged:
{code:java}
2018-06-25 16:58:09,777 [main] WARN impl.BlockReaderFactory
(BlockReaderFactory.java:requestFileDescriptors(631)) -
BlockReaderFactory(fileName=null,
block=BP-778337774-127.0.1.1-1529938688855:blk_1073741825_1001): error creating
ShortCircuitReplica.
java.io.IOException: the datanode
DatanodeInfoWithStorage[127.0.0.1:41377,DS-83cd5e5c-95bb-4b16-a438-33dfc05608d8,DISK]
failed to pass a file descriptor (might have reached open file limit).
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:614)
at
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:553)
at
org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitCache.testRequestFileDescriptorsWhenULimit(TestShortCircuitCache.java:904)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
at
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
at
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
at
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413)
{code}
The warning message also states that the box "might have reached open file
limit" as currently I'm not sure how to tell exactly why the native code failed
to acquire new file descriptors.
There's also an explanation, provided by HDFS-5810, in this scenario why a null
is returned by the function:
{code}
// This indicates an error reading from disk, or a format error. Since
// it's not a socket communication problem, we return null rather than
// throwing an exception.
{code}
This approach in my understanding is aligned with the defined workflow
describing short circuit replication handling.
Do you have anything in mind how to improve the handling of such a scenario?
> NPE when request file descriptors when SC read
> ----------------------------------------------
>
> Key: HDFS-13121
> URL: https://issues.apache.org/jira/browse/HDFS-13121
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 3.0.0
> Reporter: Gang Xie
> Assignee: Zsolt Venczel
> Priority: Minor
> Attachments: HDFS-13121.01.patch, HDFS-13121.02.patch,
> HDFS-13121.03.patch, HDFS-13121.04.patch, test-only.patch
>
>
> Recently, we hit an issue that the DFSClient throws NPE. The case is that,
> the app process exceeds the limit of the max open file. In the case, the
> libhadoop never throw and exception but return null to the request of fds.
> But requestFileDescriptors use the returned fds directly without any check
> and then NPE.
>
> We need add a sanity check here of null pointer.
>
> private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer,
> Slot slot) throws IOException {
> ShortCircuitCache cache = clientContext.getShortCircuitCache();
> final DataOutputStream out =
> new DataOutputStream(new BufferedOutputStream(peer.getOutputStream()));
> SlotId slotId = slot == null ? null : slot.getSlotId();
> new Sender(out).requestShortCircuitFds(block, token, slotId, 1,
> failureInjector.getSupportsReceiptVerification());
> DataInputStream in = new DataInputStream(peer.getInputStream());
> BlockOpResponseProto resp = BlockOpResponseProto.parseFrom(
> PBHelperClient.vintPrefixed(in));
> DomainSocket sock = peer.getDomainSocket();
> failureInjector.injectRequestFileDescriptorsFailure();
> switch (resp.getStatus()) {
> case SUCCESS:
> byte buf[] = new byte[1];
> FileInputStream[] fis = new FileInputStream[2];
> {color:#d04437}sock.recvFileInputStreams(fis, buf, 0, buf.length);{color}
> ShortCircuitReplica replica = null;
> try {
> ExtendedBlockId key =
> new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId());
> if (buf[0] == USE_RECEIPT_VERIFICATION.getNumber()) {
> LOG.trace("Sending receipt verification byte for slot {}", slot);
> sock.getOutputStream().write(0);
> }
> {color:#d04437}replica = new ShortCircuitReplica(key, fis[0], fis[1],
> cache,{color}
> {color:#d04437} Time.monotonicNow(), slot);{color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]