Thanks Ashutosh. Filed https://issues.apache.org/jira/browse/HIVE-5172
On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan <hashut...@apache.org>wrote: > Thanks Agatea for digging in. Seems like you have hit a bug. Would you > mind opening a jira and adding your findings to it. > > Thanks, > Ashutosh > > > On Thu, Aug 29, 2013 at 11:22 AM, agateaaa <agate...@gmail.com> wrote: > >> Sorry hit send too soon ... >> >> Hi All: >> >> Put some debugging code in TUGIContainingTransport.getTransport() and I >> tracked it down to >> >> @Override >> public TUGIContainingTransport getTransport(TTransport trans) { >> >> // UGI information is not available at connection setup time, it will be >> set later >> // via set_ugi() rpc. >> transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); >> >> //return transMap.get(trans); //<-change >> TUGIContainingTransport retTrans = transMap.get(trans); >> >> if ( retTrans == null ) { >> LOGGER.error (" cannot find transport that was in map !!") >> } else { >> LOGGER.debug (" cannot find transport that was in map !!") >> return retTrans; >> } >> } >> >> When we run this in our test environment, see that we run into the problem >> just after GC runs, >> and "cannot find transport that was in the map!!" message gets logged. >> >> Could the GC be collecting entries from transMap, just before the we get >> it >> >> Tried a minor change which seems to work >> >> public TUGIContainingTransport getTransport(TTransport trans) { >> >> TUGIContainingTransport retTrans = transMap.get(trans); >> >> if ( retTrans == null ) { >> // UGI information is not available at connection setup time, it will be >> set later >> // via set_ugi() rpc. >> transMap.putIfAbsent(trans, retTrans); >> } >> return retTrans; >> } >> >> >> My questions for hive and thrift experts >> >> 1.) Do we need to use a ConcurrentMap >> ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new >> MapMaker().weakKeys().weakValues().makeMap(); >> It does use == to compare keys (which might be the problem), also in this >> case we cant rely on the trans to be always there in the transMap, even >> after a put, so in that case change above >> probably makes sense >> >> >> 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was >> looking at org.apache.thrift.transport.TSaslServerTransport, esp change >> made by THRIFT-1468) >> >> e.g. >> private static Map<TTransport, WeakReference<TUGIContainingTransport>> >> transMap3 = Collections.synchronizedMap(new WeakHashMap<TTransport, >> WeakReference<TUGIContainingTransport>>()); >> >> getTransport() would be something like >> >> public TUGIContainingTransport getTransport(TTransport trans) { >> WeakReference<TUGIContainingTransport> ret = transMap.get(trans); >> if (ret == null || ret.get() == null) { >> ret = new WeakReference<TUGIContainingTransport>(new >> TUGIContainingTransport(trans)); >> transMap3.put(trans, ret); // No need for putIfAbsent(). >> // Concurrent calls to getTransport() will pass in different TTransports. >> } >> return ret.get(); >> } >> >> >> I did try 1.) above in our test environment and it does seem to resolve >> the >> problem, though i am not sure if I am introducing any other problem >> >> >> Can someone help ? >> >> >> Thanks >> Agatea >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Aug 29, 2013 at 10:57 AM, agateaaa <agate...@gmail.com> wrote: >> >> > Hi All: >> > >> > Put some debugging code in TUGIContainingTransport.getTransport() and I >> > tracked it down to >> > >> > @Override >> > public TUGIContainingTransport getTransport(TTransport trans) { >> > >> > // UGI information is not available at connection setup time, it will be >> > set later >> > // via set_ugi() rpc. >> > transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); >> > >> > //return transMap.get(trans); <-change >> > TUGIContainingTransport retTrans = transMap.get(trans); >> > >> > if ( retTrans == null ) { >> > >> > >> > >> > } >> > >> > >> > >> > >> > >> > On Wed, Jul 31, 2013 at 9:48 AM, agateaaa <agate...@gmail.com> wrote: >> > >> >> Thanks Nitin >> >> >> >> There arent too many connections in close_wait state only 1 or two when >> >> we run into this. Most likely its because of dropped connection. >> >> >> >> I could not find any read or write timeouts we can set for the thrift >> >> server which will tell thrift to hold on to the client connection. >> >> See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt >> >> seem to have been implemented yet. We do have set a client connection >> >> timeout but cannot find >> >> an equivalent setting for the server. >> >> >> >> We have a suspicion that this happens when we run two client processes >> >> which modify two distinct partitions of the same hive table. We put in >> a >> >> workaround so that the two hive client processes never run together >> and so >> >> far things look ok but we will keep monitoring. >> >> >> >> Could it be because hive metastore server is not thread safe, would >> >> running two alter table statements on two distinct partitions of the >> same >> >> table using two client connections cause problems like these, where >> hive >> >> metastore server closes or drops a wrong client connection and leaves >> the >> >> other hanging? >> >> >> >> Agateaaa >> >> >> >> >> >> >> >> >> >> On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar <nitinpawar...@gmail.com >> >wrote: >> >> >> >>> The mentioned flow is called when you have unsecure mode of thrift >> >>> metastore client-server connection. So one way to avoid this is have a >> >>> secure way. >> >>> >> >>> <code> >> >>> public boolean process(final TProtocol in, final TProtocol out) >> >>> throwsTException { >> >>> setIpAddress(in); >> >>> ... >> >>> ... >> >>> ... >> >>> @Override >> >>> protected void setIpAddress(final TProtocol in) { >> >>> TUGIContainingTransport ugiTrans = >> >>> (TUGIContainingTransport)in.getTransport(); >> >>> Socket socket = ugiTrans.getSocket(); >> >>> if (socket != null) { >> >>> setIpAddress(socket); >> >>> >> >>> </code> >> >>> >> >>> >> >>> From the above code snippet, it looks like the null pointer exception >> is >> >>> not handled if the getSocket returns null. >> >>> >> >>> can you check whats the ulimit setting on the server? If its set to >> >>> default >> >>> can you set it to unlimited and restart hcat server. (This is just a >> wild >> >>> guess). >> >>> >> >>> also the getSocket method suggests "If the underlying TTransport is an >> >>> instance of TSocket, it returns the Socket object which it contains. >> >>> Otherwise it returns null." >> >>> >> >>> so someone from thirft gurus need to tell us whats happening. I have >> no >> >>> knowledge of this depth >> >>> >> >>> may be Ashutosh or Thejas will be able to help on this. >> >>> >> >>> >> >>> >> >>> >> >>> From the netstat close_wait, it looks like the hive metastore server >> has >> >>> not closed the connection (do not know why yet), may be the hive dev >> guys >> >>> can help.Are there too many connections in close_wait state? >> >>> >> >>> >> >>> >> >>> On Tue, Jul 30, 2013 at 5:52 AM, agateaaa <agate...@gmail.com> wrote: >> >>> >> >>> > Looking at the hive metastore server logs see errors like these: >> >>> > >> >>> > 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer >> >>> > (TThreadPoolServer.java:run(182)) - Error occurred during >> processing of >> >>> > message. >> >>> > java.lang.NullPointerException >> >>> > at >> >>> > >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) >> >>> > at >> >>> > >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) >> >>> > at >> >>> > >> >>> > >> >>> >> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) >> >>> > at >> >>> > >> >>> > >> >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> >>> > at >> >>> > >> >>> > >> >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> >>> > at java.lang.Thread.run(Thread.java:662) >> >>> > >> >>> > approx same time as we see timeout or connection reset errors. >> >>> > >> >>> > Dont know if this is the cause or the side affect of he connection >> >>> > timeout/connection reset errors. Does anybody have any pointers or >> >>> > suggestions ? >> >>> > >> >>> > Thanks >> >>> > >> >>> > >> >>> > On Mon, Jul 29, 2013 at 11:29 AM, agateaaa <agate...@gmail.com> >> wrote: >> >>> > >> >>> > > Thanks Nitin! >> >>> > > >> >>> > > We have simiar setup (identical hcatalog and hive server versions) >> >>> on a >> >>> > > another production environment and dont see any errors (its been >> >>> running >> >>> > ok >> >>> > > for a few months) >> >>> > > >> >>> > > Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or >> >>> hive >> >>> > > 0.10 soon. >> >>> > > >> >>> > > I did see that the last time we ran into this problem doing a >> >>> netstat-ntp >> >>> > > | grep ":10000" see that server was holding on to one socket >> >>> connection >> >>> > in >> >>> > > CLOSE_WAIT state for a long time >> >>> > > (hive metastore server is running on port 10000). Dont know if >> thats >> >>> > > relevant here or not >> >>> > > >> >>> > > Can you suggest any hive configuration settings we can tweak or >> >>> > networking >> >>> > > tools/tips, we can use to narrow this down ? >> >>> > > >> >>> > > Thanks >> >>> > > Agateaaa >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar < >> >>> nitinpawar...@gmail.com >> >>> > >wrote: >> >>> > > >> >>> > >> Is there any chance you can do a update on test environment with >> >>> > hcat-0.5 >> >>> > >> and hive-0(11 or 10) and see if you can reproduce the issue? >> >>> > >> >> >>> > >> We used to see this error when there was load on hcat server or >> some >> >>> > >> network issue connecting to the server(second one was rare >> >>> occurrence) >> >>> > >> >> >>> > >> >> >>> > >> On Mon, Jul 29, 2013 at 11:13 PM, agateaaa <agate...@gmail.com> >> >>> wrote: >> >>> > >> >> >>> > >>> Hi All: >> >>> > >>> >> >>> > >>> We are running into frequent problem using HCatalog 0.4.1 (HIve >> >>> > Metastore >> >>> > >>> Server 0.9) where we get connection reset or connection timeout >> >>> errors. >> >>> > >>> >> >>> > >>> The hive metastore server has been allocated enough (12G) >> memory. >> >>> > >>> >> >>> > >>> This is a critical problem for us and would appreciate if anyone >> >>> has >> >>> > any >> >>> > >>> pointers. >> >>> > >>> >> >>> > >>> We did add a retry logic in our client, which seems to help, >> but I >> >>> am >> >>> > >>> just >> >>> > >>> wondering how can we narrow down to the root cause >> >>> > >>> of this problem. Could this be a hiccup in networking which >> causes >> >>> the >> >>> > >>> hive >> >>> > >>> server to get into a unresponsive state ? >> >>> > >>> >> >>> > >>> Thanks >> >>> > >>> >> >>> > >>> Agateaaa >> >>> > >>> >> >>> > >>> >> >>> > >>> Example Connection reset error: >> >>> > >>> ======================= >> >>> > >>> >> >>> > >>> org.apache.thrift.transport.TTransportException: >> >>> > >>> java.net.SocketException: >> >>> > >>> Connection reset >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) >> >>> > >>> at >> >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) >> >>> > >>> at >> >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) >> >>> > >>> at >> org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) >> >>> > >>> at >> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) >> >>> > >>> at >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) >> >>> > >>> at >> >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) >> >>> > >>> at >> >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) >> >>> > >>> at >> >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) >> >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) >> >>> > >>> at >> org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) >> >>> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>> > >>> at java.lang.reflect.Method.invoke(Method.java:597) >> >>> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >>> > >>> Caused by: java.net.SocketException: Connection reset >> >>> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:168) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) >> >>> > >>> ... 30 more >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> Example Connection timeout error: >> >>> > >>> ========================== >> >>> > >>> >> >>> > >>> org.apache.thrift.transport.TTransportException: >> >>> > >>> java.net.SocketTimeoutException: Read timed out >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) >> >>> > >>> at >> >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) >> >>> > >>> at >> >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) >> >>> > >>> at >> org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) >> >>> > >>> at >> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) >> >>> > >>> at >> org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) >> >>> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) >> >>> > >>> at >> >>> > >>> >> >>> > >> >>> >> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) >> >>> > >>> at >> >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) >> >>> > >>> at >> >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) >> >>> > >>> at >> >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) >> >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) >> >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) >> >>> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>> > >>> at java.lang.reflect.Method.invoke(Method.java:597) >> >>> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >>> > >>> Caused by: java.net.SocketTimeoutException: Read timed out >> >>> > >>> at java.net.SocketInputStream.socketRead0(Native Method) >> >>> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:129) >> >>> > >>> at >> >>> > >>> >> >>> > >>> >> >>> > >> >>> >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) >> >>> > >>> ... 31 more >> >>> > >>> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> -- >> >>> > >> Nitin Pawar >> >>> > >> >> >>> > > >> >>> > > >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> Nitin Pawar >> >>> >> >> >> >> >> > >> > >