[
https://issues.apache.org/jira/browse/HDDS-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-2347:
---------------------------------
Labels: pull-request-available (was: )
> XCeiverClientGrpc's parallel use leads to NPE
> ---------------------------------------------
>
> Key: HDDS-2347
> URL: https://issues.apache.org/jira/browse/HDDS-2347
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Components: Ozone Client
> Reporter: Istvan Fajth
> Assignee: Istvan Fajth
> Priority: Major
> Labels: pull-request-available
> Attachments: changes.diff, logs.txt
>
>
> This issue came up when testing Hive with ORC tables on Ozone storage
> backend, I so far I could not reproduce it locally within a JUnit test but
> the issue.
> I am attaching a diff file that shows what logging I have added in
> XCevierClientGrpc and in KeyInputStream to get the results that made me
> arrive to the following understanding of the scenario:
> - Hive starts a couple of threads to work on the table data during query
> execution
> - There is one RPCClient that is being used by these threads
> - The threads are opening different stream to read from the same key in ozone
> - The InputStreams internally are using the same XCeiverClientGrpc
> - XCeiverClientGrpc throws the following NPE intermittently:
> {code}
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandAsync(XceiverClientGrpc.java:398)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:295)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:259)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:242)
> at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:118)
> at
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:169)
> at
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118)
> at
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224)
> at
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173)
> at
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
> at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
> at
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:555)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370)
> at
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1708)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1596)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1383)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1568)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1565)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1565)
> at
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1383)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> I have two proposals to fix this issue, one is the easy answer to put
> synchronization to the XCeiverClientGrpc code, the other one is a bit more
> complicated, let me explain below.
> Naively I would assume that when I get a client SPI instance from
> XCeiverClientManager, that instance is ready to use. In fact it is not, and
> when the user of the SPI instance sends the first request that is the point
> when the client gets essentially ready. Now if we put synchronization to this
> code, that is the easy solution, but my pragmatic half screams for a better
> solution, that ensures that the Manager essentially manages the clients that
> is giving to it's users, and the clients themselves are not getting ready by
> accident.
> I am working on a proposal that moves things around a bit, and looking for
> possible other solutions that does not feel hacky as I feel with the easy
> solution.
> I am attaching the followings:
> - a diff that shows the added extended logging in XCeiverClientGrpc and
> KeyInputStream.
> - a job log snippet from a Hive query that shows the relevant output from the
> extensive logging added by the diff in a cluster.
> - later a proposal for the fix I need to work on a bit more
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]