[ https://issues.apache.org/jira/browse/HDDS-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lokesh Jain resolved HDDS-2347. ------------------------------- Fix Version/s: 0.5.0 Resolution: Fixed > XCeiverClientGrpc's parallel use leads to NPE > --------------------------------------------- > > Key: HDDS-2347 > URL: https://issues.apache.org/jira/browse/HDDS-2347 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Client > Reporter: Istvan Fajth > Assignee: Istvan Fajth > Priority: Critical > Labels: pull-request-available > Fix For: 0.5.0 > > Attachments: changes.diff, logs.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This issue came up when testing Hive with ORC tables on Ozone storage > backend, I so far I could not reproduce it locally within a JUnit test but > the issue. > I am attaching a diff file that shows what logging I have added in > XCevierClientGrpc and in KeyInputStream to get the results that made me > arrive to the following understanding of the scenario: > - Hive starts a couple of threads to work on the table data during query > execution > - There is one RPCClient that is being used by these threads > - The threads are opening different stream to read from the same key in ozone > - The InputStreams internally are using the same XCeiverClientGrpc > - XCeiverClientGrpc throws the following NPE intermittently: > {code} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandAsync(XceiverClientGrpc.java:398) > at > org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:295) > at > org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:259) > at > org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:242) > at > org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:118) > at > org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:169) > at > org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118) > at > org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224) > at > org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173) > at > org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52) > at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75) > at > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112) > at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:555) > at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61) > at > org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1708) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1596) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1383) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1568) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1565) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1565) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1383) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > {code} > I have two proposals to fix this issue, one is the easy answer to put > synchronization to the XCeiverClientGrpc code, the other one is a bit more > complicated, let me explain below. > Naively I would assume that when I get a client SPI instance from > XCeiverClientManager, that instance is ready to use. In fact it is not, and > when the user of the SPI instance sends the first request that is the point > when the client gets essentially ready. Now if we put synchronization to this > code, that is the easy solution, but my pragmatic half screams for a better > solution, that ensures that the Manager essentially manages the clients that > is giving to it's users, and the clients themselves are not getting ready by > accident. > I am working on a proposal that moves things around a bit, and looking for > possible other solutions that does not feel hacky as I feel with the easy > solution. > I am attaching the followings: > - a diff that shows the added extended logging in XCeiverClientGrpc and > KeyInputStream. > - a job log snippet from a Hive query that shows the relevant output from the > extensive logging added by the diff in a cluster. > - later a proposal for the fix I need to work on a bit more -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org