[ 
https://issues.apache.org/jira/browse/HIVE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

agate updated HIVE-5172:
------------------------

    Affects Version/s: 0.10.0
                       0.11.0
    
> TUGIContainingTransport returning null transport, causing intermittent 
> SocketTimeoutException on hive client and NullPointerException in 
> TUGIBasedProcessor on the server
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5172
>                 URL: https://issues.apache.org/jira/browse/HIVE-5172
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.9.0, 0.10.0, 0.11.0
>            Reporter: agate
>         Attachments: HIVE-5172.1.patch.txt
>
>
> We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore 
> Server 0.9) where we get connection reset or connection timeout errors on the 
> client and NullPointerException in TUGIBasedProcessor on the server. 
> {code}
> hive client logs:
> =================
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
> at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 31 more
> {code}
> {code}
> hive metastore server logs:
> ===============================
> 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer 
> (TThreadPoolServer.java:run(182)) - Error occurred during processing of 
> message.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
>         at 
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
>         at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Adding some extra debug log messages in TUGIBasedProcessor, noticed that the 
> TUGIContainingTransport is null which results in NullPointerException on the 
> server.
> Further drilling into TUGIContainingTransport noticed that getTransport() 
> returns a null which causes the above 
> error. 
> Further corelating with GC logs observed that that error always hits when the 
> CMS GC has just kicked in 
> (but does not happen after every GC)
> Put some debugging code in TUGIContainingTransport.getTransport() and I 
> tracked it down to 
> {code}
> @Override
> public TUGIContainingTransport getTransport(TTransport trans) {
>       // UGI information is not available at connection setup time, it will 
> be set later
>       // via set_ugi() rpc.
>       transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
>       //return transMap.get(trans); //<-change
>       TUGIContainingTransport retTrans = transMap.get(trans);
>       if ( retTrans == null ) {
>              LOGGER.error ("cannot find transport that was in map !!")
>            return retTrans;
>        }  else {
>              LOGGER.debug ("found transport in map")
>              return retTrans;    
>        }
> }
> {code}
> When we run this in our test environment, see that we run into the problem 
> just after GC runs,
> and "cannot find transport that was in the map!!" message gets logged.
> Could the GC be collecting entries from transMap, just before the we get it 
> Tried a minor change which seems to work
> {code}
> public TUGIContainingTransport getTransport(TTransport trans) {
>    TUGIContainingTransport tugiTrans = transMap.get(trans);
>     if ( tugiTrans == null ) {
>       // UGI information is not available at connection setup time, it will 
> be set later
>       // via set_ugi() rpc.
>        tugiTrans = new TUGIContainingTransport(trans); //get a new 
> TUGIContainingTransport
>         transMap.putIfAbsent(trans, tugiTrans);
>     }
>    return tugiTrans;
> }
> {code}
> My questions for hive and  thrift experts
> 1.) Do we need to use a ConcurrentMap
> {code}
> ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new 
> MapMaker().weakKeys().weakValues().makeMap();
> {code}
> It does use == to compare keys (which might be the problem), also in this 
> case we cant rely on the trans to be always there in the transMap, even after 
> a put, so in that case change above probably makes sense
> 2.) Is it better idea to use WeakHashMap with WeakReference value instead ? 
> (was looking at org.apache.thrift.transport.TSaslServerTransport, especially 
> change made by THRIFT-1468 as an example)
> e.g.
> {code}
> private static Map<TTransport, WeakReference<TUGIContainingTransport>> 
> transMap = Collections.synchronizedMap(new WeakHashMap<TTransport, 
> WeakReference<TUGIContainingTransport>>());
> {code}
> getTransport() would be something like
> {code}
> public TUGIContainingTransport getTransport(TTransport trans) {
>       WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
>       if (ret == null || ret.get() == null) {
>         ret = new WeakReference<TUGIContainingTransport>(new 
> TUGIContainingTransport(trans));  
>         transMap.put(trans, ret); // No need for putIfAbsent().
>          // Concurrent calls to getTransport() will pass in different 
> TTransports.
>       }
>       return ret.get();
> }
> {code}
> 3.) Is it absolutely essential that successive calls to get getTransport() 
> need to retrieve the same TUGIContainingTransport object or its ok to create 
> a new one, if its not available in the map ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to