[jira] [Commented] (HDDS-2347) XCeiverClientGrpc's parallel use leads to NPE

2019-10-24 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958858#comment-16958858
 ] 

Istvan Fajth commented on HDDS-2347:


I have added a PR to fix this issue in the way Mukul suggested so by just 
adding the necessary synchronization to the XCeiverClientGrpc class and its 
creation. More on the changes can be found in the PR.

I think we need to do some rework on this code to ensure thread safety at 
object creation time, I still need to check a few things, but I think I will 
defer the proposal on this to a new JIRA linked to this one, as this one is 
pretty much needs to be fixed shortly and we can evaluate possible better ways 
later on.

> XCeiverClientGrpc's parallel use leads to NPE
> -
>
> Key: HDDS-2347
> URL: https://issues.apache.org/jira/browse/HDDS-2347
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>  Labels: pull-request-available
> Attachments: changes.diff, logs.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue came up when testing Hive with ORC tables on Ozone storage 
> backend, I so far I could not reproduce it locally within a JUnit test but 
> the issue.
> I am attaching a diff file that shows what logging I have added in 
> XCevierClientGrpc and in KeyInputStream to get the results that made me 
> arrive to the following understanding of the scenario:
> - Hive starts a couple of threads to work on the table data during query 
> execution
> - There is one RPCClient that is being used by these threads
> - The threads are opening different stream to read from the same key in ozone
> - The InputStreams internally are using the same XCeiverClientGrpc
> - XCeiverClientGrpc throws the following NPE intermittently:
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandAsync(XceiverClientGrpc.java:398)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:295)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:259)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:242)
> at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:118)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:169)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224)
> at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173)
> at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
> at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
> at 
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:555)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:61)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1708)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1596)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1383)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1568)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1565)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1565)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1383)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> I have two proposals to fix this issue, one is the easy answer to put 
> synchronization to the XCeiverClientGrpc 

[jira] [Commented] (HDDS-2347) XCeiverClientGrpc's parallel use leads to NPE

2019-10-22 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956945#comment-16956945
 ] 

Istvan Fajth commented on HDDS-2347:


CC: [~aengineer], [~elek], [~msingh]

> XCeiverClientGrpc's parallel use leads to NPE
> -
>
> Key: HDDS-2347
> URL: https://issues.apache.org/jira/browse/HDDS-2347
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> This issue came up when testing Hive with ORC tables on Ozone storage 
> backend, I so far I could not reproduce it locally within a JUnit test but 
> the issue.
> I am attaching a diff file that shows what logging I have added in 
> XCevierClientGrpc and in KeyInputStream to get the results that made me 
> arrive to the following understanding of the scenario:
> - Hive starts a couple of threads to work on the table data during query 
> execution
> - There is one RPCClient that is being used by these threads
> - The threads are opening different stream to read from the same key in ozone
> - The InputStreams internally are using the same XCeiverClientGrpc
> - XCeiverClientGrpc throws the following NPE intermittently:
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandAsync(XceiverClientGrpc.java:398)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:295)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:259)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:242)
> at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:118)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:169)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224)
> at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173)
> at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
> at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
> at 
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:555)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:61)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1708)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1596)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1383)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1568)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1565)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1565)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1383)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> I have two proposals to fix this issue, one is the easy answer to put 
> synchronization to the XCeiverClientGrpc code, the other one is a bit more 
> complicated, let me explain below.
> Naively I would assume that when I get a client SPI instance from 
> XCeiverClientManager, that instance is ready to use. In fact it is not, and 
> when the user of the SPI instance sends the first request that is the point 
> when the client gets essentially ready. Now if we put synchronization to this 
> code, that is the easy solution, but my pragmatic half screams for a better 
> solution, that ensures that the Manager essentially manages the clients that 
> is giving to it's users, and the clients themselves are not getting ready by 
> accident.
> I am