Could we see the master log from around the same time the region server got a session expired?
Thx, J-D On Thu, Nov 12, 2009 at 10:54 AM, Zhenyu Zhong <[email protected]> wrote: > Sorry, there was a typo in my email. But it is 10 minutes that I set in the > hbase-site.xml. > > Please see the link. > http://pastebin.com/m564c51bc > > Very much appreciated. > zhenyu > > On Thu, Nov 12, 2009 at 1:42 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> Usually if it was 10 minutes we would see sleepers complaining they >> slept for way too long. In the logs I can see a 1 minute pause but not >> 10 minutes because of the ZK debug. Are you sure you don't have a >> typo? Like what you just wrote is not valid, you wrote >> zookeeper.session.timout instead of zookeeper.session.timeout. Are you >> sure it's distributed over the cluster? Could you pastebin >> hbase-site.xml? >> >> Thx, >> >> J-D >> >> On Thu, Nov 12, 2009 at 10:28 AM, Zhenyu Zhong <[email protected]> >> wrote: >> > J-D, >> > >> > Yes, that is weird. But it is what I have in the log. Probably the log is >> > flushed by different threads, and one thread is blocked for a while? >> > >> > I set the zookeeper.session.timout=600000, which is 10 minutes. However, >> we >> > still see this problem. >> > >> > Does GC pause for such a long time period? >> > I will try different GC options to see the differences. >> > >> > thanks >> > zhenyu >> > >> > >> > On Thu, Nov 12, 2009 at 1:10 PM, Jean-Daniel Cryans <[email protected] >> >wrote: >> > >> >> I see something weird >> >> >> >> 2009-11-12 15:04:36,370 WARN >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 >> >> org.apache.hadoop.hbase.Leases$LeaseStillHeldException >> >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> >> Method) >> >> at >> >> >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >> >> at >> >> >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) >> >> ... >> >> 2009-11-12 15:03:52,682 DEBUG org.apache.zookeeper.ClientCnxn: Got >> >> ping response for sessionid:0x224e55436ad0004 after 0ms >> >> 2009-11-12 15:04:52,480 WARN org.apache.zookeeper.ClientCnxn: >> >> Exception closing session 0x24e55436a0007d to >> >> sun.nio.ch.selectionkeyi...@136f6ed1 >> >> >> >> Goes back in time? I does look better tho. Did you say you set the >> >> zookeeper session timeout higher in a previous thread? I'd be >> >> interested to see if a higher (2-3 minutes) value would solve the >> >> problem until java 7. >> >> >> >> The only thing that happens after that LeaseStillHeldException is that >> >> for some reason (probably GC) the region server paused. This caused >> >> the session to timeout and should be why the RS isn't connected to the >> >> ZK quorum member. >> >> >> >> BTW we worked with the Zookeeper dev team to have more meaningful >> >> exceptions and it's already committed in their trunk. >> >> >> >> J-D >> >> >> >> On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong <[email protected]> >> >> wrote: >> >> > Follow up on the regionserver disconnection: >> >> > >> >> > I made a few changes to the cluster setup. >> >> > 1. split the regionserver and zookeeper quorum servers, such that >> HRegion >> >> > doesn't run on the same box as HQuorumPeer. I assume that there should >> be >> >> no >> >> > disk io contention between the regionserver and zookeeper. >> >> > Now we have 14 regionservers, and 5 zookeeper quorum servers. >> >> > >> >> > Though I experienced 2 regionserver disconnection this morning, it >> looks >> >> > better from the regionserver log. (please see the following log) >> >> > >> >> > >> >> > http://pastebin.com/m496dbfae >> >> > >> >> > >> >> > I checked diskIO for one of the regionserver(192.168.100.116) during >> the >> >> > time it disconnected. >> >> > >> >> > Time: 03:04:58 AM >> >> > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >> >> > sda 105.31 5458.83 19503.64 9043873239 >> 32312473676 >> >> > sda1 2.90 6.64 99.25 10993934 164433464 >> >> > sda2 1.72 23.76 30.25 39365817 50124008 >> >> > sda3 0.30 0.38 3.58 624930 5923000 >> >> > sda4 100.39 5428.06 19370.56 8992888270 >> 32091993204 >> >> > >> >> > >> >> > I also checked the zookeeper quorum server that the regionserver tried >> to >> >> > connect according to the log. However, I don't see 192.168.100.116 in >> the >> >> > client list of the zookeeper quorum server that the regionserver tried >> to >> >> > connect. >> >> > Would that be a problem? >> >> > >> >> > Thu Nov 12 15:04:52 UTC 2009 >> >> > Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 GMT >> >> > Clients: >> >> > /192.168.100.127:43045[1](queued=0,recved=26,sent=0) >> >> > /192.168.100.131:39091[1](queued=0,recved=964,sent=0) >> >> > /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) >> >> > /192.168.100.123:47935[1](queued=0,recved=233,sent=0) >> >> > /192.168.100.125:46931[1](queued=0,recved=2,sent=0) >> >> > /192.168.100.118:54924[1](queued=0,recved=295,sent=0) >> >> > /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) >> >> > /192.168.100.136:42243[1](queued=0,recved=0,sent=0) >> >> > >> >> > Latency min/avg/max: 0/17/6333 >> >> > Received: 47111 >> >> > Sent: 0 >> >> > Outstanding: 0 >> >> > Zxid: 0x77000083f4 >> >> > Mode: leader >> >> > Node count: 23 >> >> > >> >> > Best, >> >> > zhenyu >> >> > >> >> > >> >> > On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < >> [email protected] >> >> >wrote: >> >> > >> >> >> Stack >> >> >> >> >> >> I am very appreciated for your comments. >> >> >> I will use the zookeeper monitoring script on my cluster and let it >> run >> >> >> overnight to see the result. >> >> >> I will follow up the thread when I get anything. >> >> >> >> >> >> thanks >> >> >> zhenyu >> >> >> >> >> >> >> >> >> On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> wrote: >> >> >> >> >> >>> I see these in your log too: >> >> >>> >> >> >>> >> >> >>> 1. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x424dfd908c50009 after 4544ms >> >> >>> 2. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x24dfd90c810002 after 4548ms >> >> >>> 3. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x424dfd908c50009 after 9030ms >> >> >>> 4. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x24dfd90c810002 after 9030ms >> >> >>> 5. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x424dfd908c50009 after 0ms >> >> >>> 6. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn: >> Got >> >> >>> ping response for sessionid:0x24dfd90c810002 after 0ms >> >> >>> >> >> >>> Which would indicate lag talking to zk too. >> >> >>> >> >> >>> Appending HLog also seems to get steadily worse. >> >> >>> >> >> >>> I also see these: >> >> >>> >> >> >>> >> >> >>> 1. 2009-11-11 04:27:01,233 WARN org.apache.zookeeper.ClientCnxn: >> >> >>> Exception closing session 0x24dfd90c810002 to >> >> >>> sun.nio.ch.selectionkeyi...@5e16a7e3 >> >> >>> 2. java.io.IOException: Read error rc = -1 >> >> >>> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] >> >> >>> 3. at >> >> >>> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) >> >> >>> 4. at >> >> >>> >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) >> >> >>> >> >> >>> >> >> >>> Were all 5 quorum members participating? Check by doing "echo >> stat|nc >> >> >>> QUORUM_HOST port" for each member of quorum.... the below might >> help: >> >> >>> >> >> >>> members="1,2,3,4,5" >> >> >>> while true; do date; for i in $members; do echo $i; echo stat | nc >> $i >> >> >>> 2181; >> >> >>> done; sleep 10; done >> >> >>> >> >> >>> We need to improve our zk_dump. It just reports quorum members... >> not >> >> if >> >> >>> they are actually participating or not. The above will tell you if >> >> dead >> >> >>> or >> >> >>> alive. >> >> >>> >> >> >>> Please report back after checking above and following the >> Jean-Daniel >> >> >>> prescription. >> >> >>> >> >> >>> Thanks, >> >> >>> St.Ack >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Wed, Nov 11, 2009 at 12:29 PM, Zhenyu Zhong < >> >> [email protected] >> >> >>> >wrote: >> >> >>> >> >> >>> > Thanks a lot for the help. >> >> >>> > >> >> >>> > I understand that the change of disk configurations should be done >> >> >>> > progressively such that HDFS can balance and recover from a disk >> >> going >> >> >>> > away. >> >> >>> > >> >> >>> > The data we are trying to import into HBase is about 200million >> rows, >> >> >>> each >> >> >>> > row is about hundreds of bytes. Basically, we read in a text file >> >> from >> >> >>> > HDFS, >> >> >>> > then for each row in the text file, we fetch data from HBase >> table, >> >> we >> >> >>> > compare the column value and update the latest value back into >> HBase. >> >> I >> >> >>> was >> >> >>> > suggested to use CheckAndPut. Anyway this is the common usecase >> on >> >> my >> >> >>> > side. >> >> >>> > >> >> >>> > >> >> >>> > Many thanks >> >> >>> > zhenyu >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Wed, Nov 11, 2009 at 3:03 PM, Jean-Daniel Cryans < >> >> >>> [email protected] >> >> >>> > >wrote: >> >> >>> > >> >> >>> > > 6 disks, 19 machines and you see those kind of issues? I'm quite >> >> >>> > > surprised and I want to learn more about what you trying to to. >> >> >>> > > >> >> >>> > > But first, if hbase is managing ZK for you then you can change >> the >> >> >>> > > value of hbase.zookeeper.property.dataDir so that it points to >> its >> >> own >> >> >>> > > mount. Make sure you also change dfs.data.dir in >> conf/hdfs-site.xml >> >> >>> > > (hadoop stuff) so that it stops pointing there (you should have >> 6 >> >> >>> > > values there at the moment if you configured it correctly). >> Doing >> >> this >> >> >>> > > could result in data loss if you do it all at once so do the >> change >> >> >>> > > progressively. This also means that you may have to keep a >> seperate >> >> >>> > > HDFS configuration for those particular nodes. Please ask more >> >> >>> > > questions if it's not very clear. >> >> >>> > > >> >> >>> > > WRT to dedicating nodes to Zookeeeper, it makes sense only if >> you >> >> have >> >> >>> > > a lot of nodes but in your case, unless there's also CPU >> >> contention, >> >> >>> > > you're good with only a dedicated disk. >> >> >>> > > >> >> >>> > > Now for what you are trying to do, is this a massive import that >> >> goes >> >> >>> > > over many hours? How big is each row? I'd love to get a lot of >> >> details >> >> >>> > > to help you optimize your cluster for your particular problem. >> >> >>> > > >> >> >>> > > Thx, >> >> >>> > > >> >> >>> > > J-D >> >> >>> > > >> >> >>> > > On Wed, Nov 11, 2009 at 11:51 AM, Zhenyu Zhong < >> >> >>> [email protected]> >> >> >>> > > wrote: >> >> >>> > > > J-D, >> >> >>> > > > >> >> >>> > > > Thanks a lot for the insightful comments. >> >> >>> > > > We are running 5 zookeeper quorums now. On these 5 quorum >> >> servers, >> >> >>> we >> >> >>> > > also >> >> >>> > > > run datanode, tasktracker and regionserver. >> >> >>> > > > Now we have 6 disks per machine. >> >> >>> > > > If it is the disk IO saturation, is there anyway to configure >> >> >>> zookeeper >> >> >>> > > to >> >> >>> > > > write to a seperate disk which is not the same one used by >> HDFS? >> >> >>> > > > Or shall we run zookeeper quorums on different nodes other >> than >> >> >>> HDFS? >> >> >>> > > > >> >> >>> > > > Are HLogs generated by Regionservers? Would it alleviate the >> >> issue >> >> >>> if >> >> >>> > we >> >> >>> > > use >> >> >>> > > > the first 14 for regionserver, and the rest of 5 for >> >> zookeeper?(In >> >> >>> that >> >> >>> > > > sense, the node running zookeeper quorum won't run >> regionserver) >> >> >>> > > > >> >> >>> > > > I am very appreciated for your comments. >> >> >>> > > > >> >> >>> > > > Best, >> >> >>> > > > zhenyu >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > On Wed, Nov 11, 2009 at 2:30 PM, Jean-Daniel Cryans < >> >> >>> > [email protected] >> >> >>> > > >wrote: >> >> >>> > > > >> >> >>> > > >> The problem seems to be that the IO is saturated. When you >> see >> >> >>> this: >> >> >>> > > >> >> >> >>> > > >> # >> >> >>> > > >> 2009-11-11 04:28:10,308 WARN >> >> >>> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler >> 9 >> >> on >> >> >>> > > >> 60021 took 6819ms appending an edit to hlog; editcount=370 >> >> >>> > > >> # >> >> >>> > > >> 2009-11-11 04:28:10,308 WARN >> >> >>> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler >> 2 >> >> on >> >> >>> > > >> 60021 took 4412ms appending an edit to hlog; editcount=371 >> >> >>> > > >> >> >> >>> > > >> It's a sign that HDFS is slow. I guess that the zookeeper >> quorum >> >> >>> > > >> members are on those same nodes using the same disks? If so, >> >> >>> there's >> >> >>> > > >> the problem because Zookeeper needs to be able to commit >> >> >>> transactions >> >> >>> > > >> very fast and if your disks are already hogged then they have >> to >> >> >>> wait >> >> >>> > > >> and you can get disconnections. Do you still have 19 quorum >> >> >>> members? >> >> >>> > > >> How many disks do you have per machine? >> >> >>> > > >> >> >> >>> > > >> Thx >> >> >>> > > >> >> >> >>> > > >> J-D >> >> >>> > > >> >> >> >>> > > >> On Wed, Nov 11, 2009 at 11:22 AM, Zhenyu Zhong < >> >> >>> > [email protected] >> >> >>> > > > >> >> >>> > > >> wrote: >> >> >>> > > >> > Please take a look at the following regionserver log. >> >> >>> > > >> > The error started to come out from 4:24am UTC to 4:29am >> UTC. >> >> >>> > > >> > >> >> >>> > > >> > http://pastebin.com/m3717d064 >> >> >>> > > >> > >> >> >>> > > >> > Thanks in advance. >> >> >>> > > >> > >> >> >>> > > >> > Best, >> >> >>> > > >> > zhenyu >> >> >>> > > >> > >> >> >>> > > >> > >> >> >>> > > >> > On Wed, Nov 11, 2009 at 2:01 PM, Jean-Daniel Cryans < >> >> >>> > > [email protected] >> >> >>> > > >> >wrote: >> >> >>> > > >> > >> >> >>> > > >> >> Could you paste a bigger log trace (interested to see at >> >> least >> >> >>> all >> >> >>> > > 300 >> >> >>> > > >> >> lines before that) in a pastebin? >> >> >>> > > >> >> >> >> >>> > > >> >> Thx, >> >> >>> > > >> >> >> >> >>> > > >> >> J-D >> >> >>> > > >> >> >> >> >>> > > >> >> On Wed, Nov 11, 2009 at 10:56 AM, Zhenyu Zhong < >> >> >>> > > [email protected] >> >> >>> > > >> > >> >> >>> > > >> >> wrote: >> >> >>> > > >> >> > J-D, >> >> >>> > > >> >> > >> >> >>> > > >> >> > Thanks for the reply. >> >> >>> > > >> >> > >> >> >>> > > >> >> > The problem I have is one or more of the regionservers >> >> >>> disconnect >> >> >>> > > from >> >> >>> > > >> >> the >> >> >>> > > >> >> > cluster. >> >> >>> > > >> >> > >> >> >>> > > >> >> > I checked the resource usage including memory Heap usage >> >> for >> >> >>> > > >> >> regionserver, >> >> >>> > > >> >> > and found nothing suspicious. The regionserver heap >> usage >> >> is >> >> >>> way >> >> >>> > > below >> >> >>> > > >> >> the >> >> >>> > > >> >> > maxHeap defined in jvm. >> >> >>> > > >> >> > >> >> >>> > > >> >> > So I checked zookeeper log, then I found the following >> >> >>> > > >> >> > BufferUnderflowException, eventually it leads to >> >> >>> SocketException: >> >> >>> > > >> >> Transport >> >> >>> > > >> >> > endpoint is not connected. >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > I also checked the regionserver log. It looks like the >> >> >>> > regionserver >> >> >>> > > >> tried >> >> >>> > > >> >> to >> >> >>> > > >> >> > connect to other zookeeper quorum server. It connects >> >> >>> > successfully, >> >> >>> > > >> but >> >> >>> > > >> >> then >> >> >>> > > >> >> > it complained about SocketException: transport endpoint >> is >> >> not >> >> >>> > > >> connected. >> >> >>> > > >> >> > >> >> >>> > > >> >> > So both zookeeper log and regionserver log told me that >> the >> >> >>> > > transport >> >> >>> > > >> >> > endpoint is not connected. I don't know why. >> >> >>> > > >> >> > >> >> >>> > > >> >> > I would be very appreciated if you could point me to the >> >> right >> >> >>> > > >> direction. >> >> >>> > > >> >> > >> >> >>> > > >> >> > Best, >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > Regionserver log: >> >> >>> > > >> >> > 2009-11-11 04:24:53,718 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > >> Attempting >> >> >>> > > >> >> > connection to server superpyxis0018/ >> 192.168.100.132:2181 >> >> >>> > > >> >> > 2009-11-11 04:24:53,718 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > Priming >> >> >>> > > >> >> > connection to java.nio.channels.SocketChannel[connected >> >> >>> local=/ >> >> >>> > > >> >> > 192.168.100.131:43065 remote=superpyxis0018/ >> >> >>> 192.168.100.132:2181 >> >> >>> > ] >> >> >>> > > >> >> > 2009-11-11 04:24:53,718 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > Server >> >> >>> > > >> >> > connection successful >> >> >>> > > >> >> > 2009-11-11 04:24:53,768 INFO >> >> >>> > > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got >> >> >>> ZooKeeper >> >> >>> > > >> event, >> >> >>> > > >> >> > state: Disconnected, type: None, path: null >> >> >>> > > >> >> > 2009-11-11 04:24:54,118 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > >> Attempting >> >> >>> > > >> >> > connection to server superpyxis0017/ >> 192.168.100.131:2181 >> >> >>> > > >> >> > 2009-11-11 04:24:54,118 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > Priming >> >> >>> > > >> >> > connection to java.nio.channels.SocketChannel[connected >> >> >>> local=/ >> >> >>> > > >> >> > 192.168.100.131:46903 remote=superpyxis0017/ >> >> >>> 192.168.100.131:2181 >> >> >>> > ] >> >> >>> > > >> >> > 2009-11-11 04:24:54,118 INFO >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > Server >> >> >>> > > >> >> > connection successful >> >> >>> > > >> >> > 2009-11-11 04:24:54,118 WARN >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > >> Exception >> >> >>> > > >> >> > closing session 0x24dfd90c810002 to >> >> >>> > > >> sun.nio.ch.selectionkeyi...@2dd78ccf >> >> >>> > > >> >> > java.io.IOException: Read error rc = -1 >> >> >>> > > >> java.nio.DirectByteBuffer[pos=0 >> >> >>> > > >> >> > lim=4 cap=4] >> >> >>> > > >> >> > at >> >> >>> > > >> >> > >> >> >>> > > >> >> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) >> >> >>> > > >> >> > at >> >> >>> > > >> >> > >> >> >>> > >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) >> >> >>> > > >> >> > 2009-11-11 04:24:54,118 WARN >> >> org.apache.zookeeper.ClientCnxn: >> >> >>> > > Ignoring >> >> >>> > > >> >> > exception during shutdown input >> >> >>> > > >> >> > java.net.SocketException: Transport endpoint is not >> >> connected >> >> >>> > > >> >> > at sun.nio.ch.SocketChannelImpl.shutdown(Native >> >> Method) >> >> >>> > > >> >> > at >> >> >>> > > >> >> > >> >> >>> > > >> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) >> >> >>> > > >> >> > at >> >> >>> > > >> >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) >> >> >>> > > >> >> > at >> >> >>> > > >> >> > >> >> >>> > > >> >> >> >>> > >> >> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) >> >> >>> > > >> >> > at >> >> >>> > > >> >> > >> >> >>> > >> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > >> >> >>> > > >> >> > On Wed, Nov 11, 2009 at 1:40 PM, Jean-Daniel Cryans < >> >> >>> > > >> [email protected] >> >> >>> > > >> >> >wrote: >> >> >>> > > >> >> > >> >> >>> > > >> >> >> I'm really not sure exactly what the situation is with >> the >> >> >>> > > >> >> >> java.nio.BufferUnderflowException, but what do you mean >> by >> >> >>> > > >> >> >> "regionserver disconnection"? Could you give more >> details? >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> Thx, >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> J-D >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> On Wed, Nov 11, 2009 at 10:34 AM, Zhenyu Zhong < >> >> >>> > > >> [email protected] >> >> >>> > > >> >> > >> >> >>> > > >> >> >> wrote: >> >> >>> > > >> >> >> > I searched around and found HBASE-1645 "HBase could >> be a >> >> >>> > little >> >> >>> > > >> more >> >> >>> > > >> >> >> dainty >> >> >>> > > >> >> >> > when zk is pulled out from under it" >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> > I am wondering whether my problem is related to this >> >> >>> > HBase-1645. >> >> >>> > > Or >> >> >>> > > >> >> >> probably >> >> >>> > > >> >> >> > has nothing to do with it? >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> > thanks >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> > On Wed, Nov 11, 2009 at 12:02 PM, Zhenyu Zhong < >> >> >>> > > >> >> [email protected] >> >> >>> > > >> >> >> >wrote: >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> >> Hi, >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> I have been monitoring the system resources such as >> >> memory >> >> >>> > Heap >> >> >>> > > >> usage >> >> >>> > > >> >> >> for >> >> >>> > > >> >> >> >> Regionserver, load average, network bandwidth, etc. >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> I experienced a disconnection last night. When I >> >> checked >> >> >>> the >> >> >>> > > >> memory >> >> >>> > > >> >> Heap >> >> >>> > > >> >> >> >> for Regionserver, it was 2336 which is way below the >> >> max >> >> >>> Heap >> >> >>> > > >> 4096. I >> >> >>> > > >> >> >> don't >> >> >>> > > >> >> >> >> think it is Regionserver problem. The requests that >> the >> >> >>> > > >> regionserver >> >> >>> > > >> >> got >> >> >>> > > >> >> >> was >> >> >>> > > >> >> >> >> about 126 which doesn't seem to be high. The load >> >> average >> >> >>> is >> >> >>> > > kind >> >> >>> > > >> of >> >> >>> > > >> >> >> high >> >> >>> > > >> >> >> >> >100. (but I saw disconnections while the load >> average >> >> is >> >> >>> > > around >> >> >>> > > >> 30). >> >> >>> > > >> >> >> The >> >> >>> > > >> >> >> >> bandwidth is not a problem. >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> However, I found error messages from the zookeeper. >> >> Please >> >> >>> > see >> >> >>> > > the >> >> >>> > > >> >> >> >> following error messages. There are some >> >> >>> > > >> >> >> >> BufferUnderflowException error, IOException >> error,etc. >> >> >>> > > >> >> >> >> May I ask what might be the causes that generate >> these >> >> >>> error >> >> >>> > > >> >> messages? >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> Many thanks >> >> >>> > > >> >> >> >> zhenyu >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN >> >> >>> > > org.apache.zookeeper.server.Request: >> >> >>> > > >> >> >> Ignoring >> >> >>> > > >> >> >> >> exception during toString >> >> >>> > > >> >> >> >> java.nio.BufferUnderflowException >> >> >>> > > >> >> >> >> at >> >> >>> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) >> >> >>> > > >> >> >> >> at >> java.nio.ByteBuffer.get(ByteBuffer.java:675) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >>> org.apache.zookeeper.server.Request.toString(Request.java:199) >> >> >>> > > >> >> >> >> at >> java.lang.String.valueOf(String.java:2827) >> >> >>> > > >> >> >> >> at >> >> >>> > > java.lang.StringBuilder.append(StringBuilder.java:115) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor.processRequest(CommitProcessor.java:167) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:68) >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: >> >> >>> > > >> >> >> >> Exception causing close of session 0x424dfd908c50009 >> >> >>> > > >> >> >> >> due to java.io.IOException: Read error >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: >> >> >>> > > >> >> >> >> IOException stack trace >> >> >>> > > >> >> >> >> java.io.IOException: Read error >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:372) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG >> >> >>> > > >> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor: >> >> >>> > Processing >> >> >>> > > >> >> request:: >> >> >>> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches >> >> >>> > > >> cxid:0xfffffffffffffff8 >> >> >>> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN >> >> >>> > > org.apache.zookeeper.server.Request: >> >> >>> > > >> >> >> Ignoring >> >> >>> > > >> >> >> >> exception during toString >> >> >>> > > >> >> >> >> java.nio.BufferUnderflowException >> >> >>> > > >> >> >> >> at >> >> >>> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) >> >> >>> > > >> >> >> >> at >> java.nio.ByteBuffer.get(ByteBuffer.java:675) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >>> org.apache.zookeeper.server.Request.toString(Request.java:199) >> >> >>> > > >> >> >> >> at >> java.lang.String.valueOf(String.java:2827) >> >> >>> > > >> >> >> >> at >> >> >>> > > java.lang.StringBuilder.append(StringBuilder.java:115) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG >> >> >>> > > >> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor: >> >> >>> Processing >> >> >>> > > >> >> request:: >> >> >>> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches >> >> >>> > > >> cxid:0xfffffffffffffff8 >> >> >>> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN >> >> >>> > > org.apache.zookeeper.server.Request: >> >> >>> > > >> >> >> Ignoring >> >> >>> > > >> >> >> >> exception during toString >> >> >>> > > >> >> >> >> java.nio.BufferUnderflowException >> >> >>> > > >> >> >> >> at >> >> >>> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) >> >> >>> > > >> >> >> >> at >> java.nio.ByteBuffer.get(ByteBuffer.java:675) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >>> org.apache.zookeeper.server.Request.toString(Request.java:199) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >>> >> org.apache.log4j.or.DefaultRenderer.doRender(DefaultRenderer.java:36) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> org.apache.log4j.or.RendererMap.findAndRender(RendererMap.java:80) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.log4j.spi.LoggingEvent.getRenderedMessage(LoggingEvent.java:362) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.log4j.helpers.PatternParser$BasicPatternConverter.convert(PatternParser.java:403) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> org.apache.log4j.PatternLayout.format(PatternLayout.java:502) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >>> org.apache.log4j.WriterAppender.append(WriterAppender.java:160) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >>> >> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> org.apache.log4j.Category.callAppenders(Category.java:206) >> >> >>> > > >> >> >> >> at >> >> >>> > > org.apache.log4j.Category.forcedLog(Category.java:391) >> >> >>> > > >> >> >> >> at >> >> >>> org.apache.log4j.Category.debug(Category.java:260) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:147) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN >> >> >>> > > org.apache.zookeeper.server.Request: >> >> >>> > > >> >> >> Ignoring >> >> >>> > > >> >> >> >> exception during toString >> >> >>> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches >> >> >>> > > >> cxid:0xfffffffffffffff8 >> >> >>> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 INFO >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: >> >> >>> > > >> >> >> >> closing session:0x424dfd908c50009 NIOServerCnxn: >> >> >>> > > >> >> >> >> java.nio.channels.SocketChannel[connected local=/ >> >> >>> > > >> >> 192.168.100.131:2181 >> >> >>> > > >> >> >> remote=/ >> >> >>> > > >> >> >> >> 192.168.100.131:35500] >> >> >>> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: >> >> >>> > > >> >> >> >> ignoring exception during input shutdown >> >> >>> > > >> >> >> >> java.net.SocketException: Transport endpoint is not >> >> >>> connected >> >> >>> > > >> >> >> >> at >> sun.nio.ch.SocketChannelImpl.shutdown(Native >> >> >>> > Method) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >>> >> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:848) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:498) >> >> >>> > > >> >> >> >> at >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> >> >> >>> > > >> >> >> >> >>> > > >> >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) >> >> >>> > > >> >> >> >> >> >> >>> > > >> >> >> > >> >> >>> > > >> >> >> >> >> >>> > > >> >> > >> >> >>> > > >> >> >> >> >>> > > >> > >> >> >>> > > >> >> >> >>> > > > >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> >> >> >> >> >> > >> >> >> > >> >
