Will do. thanks zhenyu
On Thu, Nov 12, 2009 at 3:33 PM, stack <[email protected]> wrote: > Enable GC logging too on this next run (see hbase-env.sh). Lets try and > get > to the bottom of whats going on. > Thanks, > St.Ack > > On Thu, Nov 12, 2009 at 12:29 PM, Zhenyu Zhong <[email protected] > >wrote: > > > I can switch the boxes that run zookeeper with the ones that run > > regionservers. > > I will see the results later. > > > > FYI. The node does have the 10 minutes zookeeper.session.timeout value in > > place. > > > > thanks > > zhenyu > > > > On Thu, Nov 12, 2009 at 3:21 PM, stack <[email protected]> wrote: > > > > > On Thu, Nov 12, 2009 at 11:50 AM, Zhenyu Zhong < > [email protected] > > > >wrote: > > > > > > > > > > > In my cluster, half of the cluster have 2 disks 400GB each per > machine, > > > and > > > > half of the cluster have 6 disks per machine. Maybe we should run > > > > zookeeper > > > > on the machines with 2 disks and RS on machines with 6 disks? > > > > > > > > That would make most sense only in the below, it looks like the RS > that > > > had > > > issue had 4 disks? > > > > > > > > > > > > > BTW, the 10 minutes zookeeper.session.timeout has been set during the > > > > experiment. > > > > > > > > > > And for sure this node had it in place? > > > St.Ack > > > > > > > > > > > > > > > > > thanks > > > > zhenyu > > > > > > > > On Thu, Nov 12, 2009 at 2:08 PM, stack <[email protected]> wrote: > > > > > > > > > On Thu, Nov 12, 2009 at 8:40 AM, Zhenyu Zhong < > > [email protected] > > > > > >wrote: > > > > > > > > > > > > > > > > Though I experienced 2 regionserver disconnection this morning, > it > > > > looks > > > > > > better from the regionserver log. (please see the following log) > > > > > > > > > > > > > > > > > > http://pastebin.com/m496dbfae > > > > > > > > > > > > > > > > > > I checked diskIO for one of the regionserver(192.168.100.116) > > during > > > > the > > > > > > time it disconnected. > > > > > > > > > > > > Time: 03:04:58 AM > > > > > > Device: tps Blk_read/s Blk_wrtn/s Blk_read > > > Blk_wrtn > > > > > > sda 105.31 5458.83 19503.64 9043873239 > > > 32312473676 > > > > > > sda1 2.90 6.64 99.25 10993934 > > > 164433464 > > > > > > sda2 1.72 23.76 30.25 39365817 > > > 50124008 > > > > > > sda3 0.30 0.38 3.58 624930 > > > 5923000 > > > > > > sda4 100.39 5428.06 19370.56 8992888270 > > > 32091993204 > > > > > > > > > > > > > > > > > Is this high for you? 20k blocks/second would seem to be high but > > its > > > > one > > > > > disk only and its not being shared by zk anymore so shouldn't > matter? > > > > > > > > > > > > > > > > I also checked the zookeeper quorum server that the regionserver > > > tried > > > > to > > > > > > connect according to the log. However, I don't see > 192.168.100.116 > > in > > > > the > > > > > > client list of the zookeeper quorum server that the regionserver > > > tried > > > > to > > > > > > connect. > > > > > > Would that be a problem? > > > > > > > > > > > > > > > > > Is that because the ephemeral node for the regionserver had > > evaporated? > > > > > Lost > > > > > its lease w/ zk by the time you went to look? > > > > > > > > > > > > > > > > > > > > > > > > > > Thu Nov 12 15:04:52 UTC 2009 > > > > > > Zookeeper version: 3.2.1-808558, built on 08/27/2009 18:48 GMT > > > > > > Clients: > > > > > > /192.168.100.127:43045[1](queued=0,recved=26,sent=0) > > > > > > /192.168.100.131:39091[1](queued=0,recved=964,sent=0) > > > > > > /192.168.100.124:35961[1](queued=0,recved=3266,sent=0) > > > > > > /192.168.100.123:47935[1](queued=0,recved=233,sent=0) > > > > > > /192.168.100.125:46931[1](queued=0,recved=2,sent=0) > > > > > > /192.168.100.118:54924[1](queued=0,recved=295,sent=0) > > > > > > /192.168.100.118:41390[1](queued=0,recved=2290,sent=0) > > > > > > /192.168.100.136:42243[1](queued=0,recved=0,sent=0) > > > > > > > > > > > > Latency min/avg/max: 0/17/6333 > > > > > > Received: 47111 > > > > > > Sent: 0 > > > > > > Outstanding: 0 > > > > > > Zxid: 0x77000083f4 > > > > > > Mode: leader > > > > > > Node count: 23 > > > > > > > > > > > > > > > > > > > > > That 6 second maximum latency is pretty bad but should be well > within > > > the > > > > > zk > > > > > session timeout. > > > > > > > > > > So, problem is likely on the zk client side of the session; i.e. in > > the > > > > RS. > > > > > You could enable GC logging as J-D suggested to see if you have any > > big > > > > > pauses, pauses > zk session timeout. > > > > > > > > > > When the RS went down, it didn't look too heavily loaded: > > > > > > > > > > > > > > > 1. 2009-11-12 15:04:52,830 INFO > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of > > metrics: > > > > > request=1.5166667, regions=322, stores=657, storefiles=631, > > > > > storefileIndexSize=61, memstoreSize=1472, usedHeap=2819, > > > maxHeap=4079, > > > > > blockCacheSize=658110960, blockCacheFree=197395984, > > > > blockCacheCount=9903, > > > > > blockCacheHitRatio=99 > > > > > > > > > > Its serving a few reads? The number of store files seems fine. > Not > > > too > > > > > much memory used. > > > > > > > > > > Looking at the logs, I see the Lease Still Held exception. This > > > happens > > > > > when the RS does its regular report to the master but the master > > thinks > > > > the > > > > > RS has since restarted. It'll think this probably because it > noticed > > > > that > > > > > the RS's znode in zk had gone away and it considered the RS dead. > > > > > > > > > > Looking too at your logs I see this gap in the zk pinging: > > > > > > > > > > > > > > > 1. 2009-11-12 15:03:39,325 DEBUG org.apache.zookeeper.ClientCnxn: > > Got > > > > > ping response for sessionid:0x224e55436ad0004 after 0ms > > > > > 2. 2009-11-12 15:03:43,113 DEBUG org.apache.zookeeper.ClientCnxn: > > Got > > > > > ping response for sessionid:0x24e55436a0007d after 0ms > > > > > > > > > > Where in the lines above it, its reporting about every ten seconds, > > > here > > > > > there is a big gap. > > > > > > > > > > Do you have ganglia or something that will let you look more into > > what > > > > was > > > > > happening on this machine around this time? Is the machine OK? It > > > looks > > > > > lightly loaded and you have your cluster nicely laid out. > Something > > > odd > > > > is > > > > > going on. What about things like the write speed to disk? In the > > past > > > > > strange issues have been explained by incorrectly set BIOS which > made > > > > disks > > > > > run at 1/100th of their proper speed. > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > zhenyu > > > > > > > > > > > > > > > > > > On Wed, Nov 11, 2009 at 3:58 PM, Zhenyu Zhong < > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Stack > > > > > > > > > > > > > > I am very appreciated for your comments. > > > > > > > I will use the zookeeper monitoring script on my cluster and > let > > it > > > > run > > > > > > > overnight to see the result. > > > > > > > I will follow up the thread when I get anything. > > > > > > > > > > > > > > thanks > > > > > > > zhenyu > > > > > > > > > > > > > > > > > > > > > On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> > wrote: > > > > > > > > > > > > > >> I see these in your log too: > > > > > > >> > > > > > > >> > > > > > > >> 1. 2009-11-11 04:27:19,018 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x424dfd908c50009 after 4544ms > > > > > > >> 2. 2009-11-11 04:27:19,018 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x24dfd90c810002 after 4548ms > > > > > > >> 3. 2009-11-11 04:27:43,960 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x424dfd908c50009 after 9030ms > > > > > > >> 4. 2009-11-11 04:27:43,960 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x24dfd90c810002 after 9030ms > > > > > > >> 5. 2009-11-11 04:27:48,830 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x424dfd908c50009 after 0ms > > > > > > >> 6. 2009-11-11 04:27:48,830 DEBUG > > > org.apache.zookeeper.ClientCnxn: > > > > > Got > > > > > > >> ping response for sessionid:0x24dfd90c810002 after 0ms > > > > > > >> > > > > > > >> Which would indicate lag talking to zk too. > > > > > > >> > > > > > > >> Appending HLog also seems to get steadily worse. > > > > > > >> > > > > > > >> I also see these: > > > > > > >> > > > > > > >> > > > > > > >> 1. 2009-11-11 04:27:01,233 WARN > > org.apache.zookeeper.ClientCnxn: > > > > > > >> Exception closing session 0x24dfd90c810002 to > > > > > > >> sun.nio.ch.selectionkeyi...@5e16a7e3 > > > > > > >> 2. java.io.IOException: Read error rc = -1 > > > > > > >> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] > > > > > > >> 3. at > > > > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > > > > > > >> 4. at > > > > > > >> > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > > > > > >> > > > > > > >> > > > > > > >> Were all 5 quorum members participating? Check by doing "echo > > > > stat|nc > > > > > > >> QUORUM_HOST port" for each member of quorum.... the below > might > > > > help: > > > > > > >> > > > > > > >> members="1,2,3,4,5" > > > > > > >> while true; do date; for i in $members; do echo $i; echo stat > | > > nc > > > > $i > > > > > > >> 2181; > > > > > > >> done; sleep 10; done > > > > > > >> > > > > > > >> We need to improve our zk_dump. It just reports quorum > > members... > > > > not > > > > > > if > > > > > > >> they are actually participating or not. The above will tell > you > > > if > > > > > dead > > > > > > >> or > > > > > > >> alive. > > > > > > >> > > > > > > >> Please report back after checking above and following the > > > > Jean-Daniel > > > > > > >> prescription. > > > > > > >> > > > > > > >> Thanks, > > > > > > >> St.Ack > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> On Wed, Nov 11, 2009 at 12:29 PM, Zhenyu Zhong < > > > > > [email protected] > > > > > > >> >wrote: > > > > > > >> > > > > > > >> > Thanks a lot for the help. > > > > > > >> > > > > > > > >> > I understand that the change of disk configurations should > be > > > done > > > > > > >> > progressively such that HDFS can balance and recover from a > > disk > > > > > going > > > > > > >> > away. > > > > > > >> > > > > > > > >> > The data we are trying to import into HBase is about > > 200million > > > > > rows, > > > > > > >> each > > > > > > >> > row is about hundreds of bytes. Basically, we read in a text > > > file > > > > > from > > > > > > >> > HDFS, > > > > > > >> > then for each row in the text file, we fetch data from > HBase > > > > table, > > > > > > we > > > > > > >> > compare the column value and update the latest value back > into > > > > > HBase. > > > > > > I > > > > > > >> was > > > > > > >> > suggested to use CheckAndPut. Anyway this is the common > > usecase > > > > on > > > > > my > > > > > > >> > side. > > > > > > >> > > > > > > > >> > > > > > > > >> > Many thanks > > > > > > >> > zhenyu > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > On Wed, Nov 11, 2009 at 3:03 PM, Jean-Daniel Cryans < > > > > > > >> [email protected] > > > > > > >> > >wrote: > > > > > > >> > > > > > > > >> > > 6 disks, 19 machines and you see those kind of issues? I'm > > > quite > > > > > > >> > > surprised and I want to learn more about what you trying > to > > > to. > > > > > > >> > > > > > > > > >> > > But first, if hbase is managing ZK for you then you can > > change > > > > the > > > > > > >> > > value of hbase.zookeeper.property.dataDir so that it > points > > to > > > > its > > > > > > own > > > > > > >> > > mount. Make sure you also change dfs.data.dir in > > > > > conf/hdfs-site.xml > > > > > > >> > > (hadoop stuff) so that it stops pointing there (you should > > > have > > > > 6 > > > > > > >> > > values there at the moment if you configured it > correctly). > > > > Doing > > > > > > this > > > > > > >> > > could result in data loss if you do it all at once so do > the > > > > > change > > > > > > >> > > progressively. This also means that you may have to keep a > > > > > seperate > > > > > > >> > > HDFS configuration for those particular nodes. Please ask > > more > > > > > > >> > > questions if it's not very clear. > > > > > > >> > > > > > > > > >> > > WRT to dedicating nodes to Zookeeeper, it makes sense only > > if > > > > you > > > > > > have > > > > > > >> > > a lot of nodes but in your case, unless there's also CPU > > > > > contention, > > > > > > >> > > you're good with only a dedicated disk. > > > > > > >> > > > > > > > > >> > > Now for what you are trying to do, is this a massive > import > > > that > > > > > > goes > > > > > > >> > > over many hours? How big is each row? I'd love to get a > lot > > of > > > > > > details > > > > > > >> > > to help you optimize your cluster for your particular > > problem. > > > > > > >> > > > > > > > > >> > > Thx, > > > > > > >> > > > > > > > > >> > > J-D > > > > > > >> > > > > > > > > >> > > On Wed, Nov 11, 2009 at 11:51 AM, Zhenyu Zhong < > > > > > > >> [email protected]> > > > > > > >> > > wrote: > > > > > > >> > > > J-D, > > > > > > >> > > > > > > > > > >> > > > Thanks a lot for the insightful comments. > > > > > > >> > > > We are running 5 zookeeper quorums now. On these 5 > quorum > > > > > servers, > > > > > > >> we > > > > > > >> > > also > > > > > > >> > > > run datanode, tasktracker and regionserver. > > > > > > >> > > > Now we have 6 disks per machine. > > > > > > >> > > > If it is the disk IO saturation, is there anyway to > > > configure > > > > > > >> zookeeper > > > > > > >> > > to > > > > > > >> > > > write to a seperate disk which is not the same one used > by > > > > HDFS? > > > > > > >> > > > Or shall we run zookeeper quorums on different nodes > other > > > > than > > > > > > >> HDFS? > > > > > > >> > > > > > > > > > >> > > > Are HLogs generated by Regionservers? Would it alleviate > > the > > > > > issue > > > > > > >> if > > > > > > >> > we > > > > > > >> > > use > > > > > > >> > > > the first 14 for regionserver, and the rest of 5 for > > > > > zookeeper?(In > > > > > > >> that > > > > > > >> > > > sense, the node running zookeeper quorum won't run > > > > regionserver) > > > > > > >> > > > > > > > > > >> > > > I am very appreciated for your comments. > > > > > > >> > > > > > > > > > >> > > > Best, > > > > > > >> > > > zhenyu > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > On Wed, Nov 11, 2009 at 2:30 PM, Jean-Daniel Cryans < > > > > > > >> > [email protected] > > > > > > >> > > >wrote: > > > > > > >> > > > > > > > > > >> > > >> The problem seems to be that the IO is saturated. When > > you > > > > see > > > > > > >> this: > > > > > > >> > > >> > > > > > > >> > > >> # > > > > > > >> > > >> 2009-11-11 04:28:10,308 WARN > > > > > > >> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server > > > handler > > > > 9 > > > > > > on > > > > > > >> > > >> 60021 took 6819ms appending an edit to hlog; > > editcount=370 > > > > > > >> > > >> # > > > > > > >> > > >> 2009-11-11 04:28:10,308 WARN > > > > > > >> > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server > > > handler > > > > 2 > > > > > > on > > > > > > >> > > >> 60021 took 4412ms appending an edit to hlog; > > editcount=371 > > > > > > >> > > >> > > > > > > >> > > >> It's a sign that HDFS is slow. I guess that the > zookeeper > > > > > quorum > > > > > > >> > > >> members are on those same nodes using the same disks? > If > > > so, > > > > > > >> there's > > > > > > >> > > >> the problem because Zookeeper needs to be able to > commit > > > > > > >> transactions > > > > > > >> > > >> very fast and if your disks are already hogged then > they > > > have > > > > > to > > > > > > >> wait > > > > > > >> > > >> and you can get disconnections. Do you still have 19 > > quorum > > > > > > >> members? > > > > > > >> > > >> How many disks do you have per machine? > > > > > > >> > > >> > > > > > > >> > > >> Thx > > > > > > >> > > >> > > > > > > >> > > >> J-D > > > > > > >> > > >> > > > > > > >> > > >> On Wed, Nov 11, 2009 at 11:22 AM, Zhenyu Zhong < > > > > > > >> > [email protected] > > > > > > >> > > > > > > > > > >> > > >> wrote: > > > > > > >> > > >> > Please take a look at the following regionserver log. > > > > > > >> > > >> > The error started to come out from 4:24am UTC to > 4:29am > > > > UTC. > > > > > > >> > > >> > > > > > > > >> > > >> > http://pastebin.com/m3717d064 > > > > > > >> > > >> > > > > > > > >> > > >> > Thanks in advance. > > > > > > >> > > >> > > > > > > > >> > > >> > Best, > > > > > > >> > > >> > zhenyu > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > On Wed, Nov 11, 2009 at 2:01 PM, Jean-Daniel Cryans < > > > > > > >> > > [email protected] > > > > > > >> > > >> >wrote: > > > > > > >> > > >> > > > > > > > >> > > >> >> Could you paste a bigger log trace (interested to > see > > at > > > > > least > > > > > > >> all > > > > > > >> > > 300 > > > > > > >> > > >> >> lines before that) in a pastebin? > > > > > > >> > > >> >> > > > > > > >> > > >> >> Thx, > > > > > > >> > > >> >> > > > > > > >> > > >> >> J-D > > > > > > >> > > >> >> > > > > > > >> > > >> >> On Wed, Nov 11, 2009 at 10:56 AM, Zhenyu Zhong < > > > > > > >> > > [email protected] > > > > > > >> > > >> > > > > > > > >> > > >> >> wrote: > > > > > > >> > > >> >> > J-D, > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > Thanks for the reply. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > The problem I have is one or more of the > > regionservers > > > > > > >> disconnect > > > > > > >> > > from > > > > > > >> > > >> >> the > > > > > > >> > > >> >> > cluster. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > I checked the resource usage including memory Heap > > > usage > > > > > for > > > > > > >> > > >> >> regionserver, > > > > > > >> > > >> >> > and found nothing suspicious. The regionserver > heap > > > > usage > > > > > is > > > > > > >> way > > > > > > >> > > below > > > > > > >> > > >> >> the > > > > > > >> > > >> >> > maxHeap defined in jvm. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > So I checked zookeeper log, then I found the > > following > > > > > > >> > > >> >> > BufferUnderflowException, eventually it leads to > > > > > > >> SocketException: > > > > > > >> > > >> >> Transport > > > > > > >> > > >> >> > endpoint is not connected. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > I also checked the regionserver log. It looks like > > the > > > > > > >> > regionserver > > > > > > >> > > >> tried > > > > > > >> > > >> >> to > > > > > > >> > > >> >> > connect to other zookeeper quorum server. It > > connects > > > > > > >> > successfully, > > > > > > >> > > >> but > > > > > > >> > > >> >> then > > > > > > >> > > >> >> > it complained about SocketException: transport > > > endpoint > > > > is > > > > > > not > > > > > > >> > > >> connected. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > So both zookeeper log and regionserver log told me > > > that > > > > > the > > > > > > >> > > transport > > > > > > >> > > >> >> > endpoint is not connected. I don't know why. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > I would be very appreciated if you could point me > to > > > the > > > > > > right > > > > > > >> > > >> direction. > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > Best, > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > Regionserver log: > > > > > > >> > > >> >> > 2009-11-11 04:24:53,718 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > >> Attempting > > > > > > >> > > >> >> > connection to server superpyxis0018/ > > > > 192.168.100.132:2181 > > > > > > >> > > >> >> > 2009-11-11 04:24:53,718 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > Priming > > > > > > >> > > >> >> > connection to > > > java.nio.channels.SocketChannel[connected > > > > > > >> local=/ > > > > > > >> > > >> >> > 192.168.100.131:43065 remote=superpyxis0018/ > > > > > > >> 192.168.100.132:2181 > > > > > > >> > ] > > > > > > >> > > >> >> > 2009-11-11 04:24:53,718 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > Server > > > > > > >> > > >> >> > connection successful > > > > > > >> > > >> >> > 2009-11-11 04:24:53,768 INFO > > > > > > >> > > >> >> > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > Got > > > > > > >> ZooKeeper > > > > > > >> > > >> event, > > > > > > >> > > >> >> > state: Disconnected, type: None, path: null > > > > > > >> > > >> >> > 2009-11-11 04:24:54,118 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > >> Attempting > > > > > > >> > > >> >> > connection to server superpyxis0017/ > > > > 192.168.100.131:2181 > > > > > > >> > > >> >> > 2009-11-11 04:24:54,118 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > Priming > > > > > > >> > > >> >> > connection to > > > java.nio.channels.SocketChannel[connected > > > > > > >> local=/ > > > > > > >> > > >> >> > 192.168.100.131:46903 remote=superpyxis0017/ > > > > > > >> 192.168.100.131:2181 > > > > > > >> > ] > > > > > > >> > > >> >> > 2009-11-11 04:24:54,118 INFO > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > Server > > > > > > >> > > >> >> > connection successful > > > > > > >> > > >> >> > 2009-11-11 04:24:54,118 WARN > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > >> Exception > > > > > > >> > > >> >> > closing session 0x24dfd90c810002 to > > > > > > >> > > >> sun.nio.ch.selectionkeyi...@2dd78ccf > > > > > > >> > > >> >> > java.io.IOException: Read error rc = -1 > > > > > > >> > > >> java.nio.DirectByteBuffer[pos=0 > > > > > > >> > > >> >> > lim=4 cap=4] > > > > > > >> > > >> >> > at > > > > > > >> > > >> >> > > > > > > > >> > > > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > > > > > > >> > > >> >> > at > > > > > > >> > > >> >> > > > > > > > >> > > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > > > > > >> > > >> >> > 2009-11-11 04:24:54,118 WARN > > > > > > org.apache.zookeeper.ClientCnxn: > > > > > > >> > > Ignoring > > > > > > >> > > >> >> > exception during shutdown input > > > > > > >> > > >> >> > java.net.SocketException: Transport endpoint is > not > > > > > > connected > > > > > > >> > > >> >> > at > > sun.nio.ch.SocketChannelImpl.shutdown(Native > > > > > > Method) > > > > > > >> > > >> >> > at > > > > > > >> > > >> >> > > > > > > > >> > > > > > > > > > > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) > > > > > > >> > > >> >> > at > > > > > > >> > > >> > > > > sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) > > > > > > >> > > >> >> > at > > > > > > >> > > >> >> > > > > > > > >> > > >> > > > > > > >> > > > > > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) > > > > > > >> > > >> >> > at > > > > > > >> > > >> >> > > > > > > > >> > > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > On Wed, Nov 11, 2009 at 1:40 PM, Jean-Daniel > Cryans > > < > > > > > > >> > > >> [email protected] > > > > > > >> > > >> >> >wrote: > > > > > > >> > > >> >> > > > > > > > >> > > >> >> >> I'm really not sure exactly what the situation is > > > with > > > > > the > > > > > > >> > > >> >> >> java.nio.BufferUnderflowException, but what do > you > > > mean > > > > > by > > > > > > >> > > >> >> >> "regionserver disconnection"? Could you give more > > > > > details? > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> >> Thx, > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> >> J-D > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> >> On Wed, Nov 11, 2009 at 10:34 AM, Zhenyu Zhong < > > > > > > >> > > >> [email protected] > > > > > > >> > > >> >> > > > > > > > >> > > >> >> >> wrote: > > > > > > >> > > >> >> >> > I searched around and found HBASE-1645 "HBase > > could > > > > be > > > > > a > > > > > > >> > little > > > > > > >> > > >> more > > > > > > >> > > >> >> >> dainty > > > > > > >> > > >> >> >> > when zk is pulled out from under it" > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> > I am wondering whether my problem is related to > > > this > > > > > > >> > HBase-1645. > > > > > > >> > > Or > > > > > > >> > > >> >> >> probably > > > > > > >> > > >> >> >> > has nothing to do with it? > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> > thanks > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> > On Wed, Nov 11, 2009 at 12:02 PM, Zhenyu Zhong > < > > > > > > >> > > >> >> [email protected] > > > > > > >> > > >> >> >> >wrote: > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> >> Hi, > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> I have been monitoring the system resources > such > > > as > > > > > > memory > > > > > > >> > Heap > > > > > > >> > > >> usage > > > > > > >> > > >> >> >> for > > > > > > >> > > >> >> >> >> Regionserver, load average, network bandwidth, > > > etc. > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> I experienced a disconnection last night. When > I > > > > > checked > > > > > > >> the > > > > > > >> > > >> memory > > > > > > >> > > >> >> Heap > > > > > > >> > > >> >> >> >> for Regionserver, it was 2336 which is way > below > > > the > > > > > max > > > > > > >> Heap > > > > > > >> > > >> 4096. I > > > > > > >> > > >> >> >> don't > > > > > > >> > > >> >> >> >> think it is Regionserver problem. The requests > > > that > > > > > the > > > > > > >> > > >> regionserver > > > > > > >> > > >> >> got > > > > > > >> > > >> >> >> was > > > > > > >> > > >> >> >> >> about 126 which doesn't seem to be high. The > > load > > > > > > average > > > > > > >> is > > > > > > >> > > kind > > > > > > >> > > >> of > > > > > > >> > > >> >> >> high > > > > > > >> > > >> >> >> >> >100. (but I saw disconnections while the load > > > > average > > > > > > is > > > > > > >> > > around > > > > > > >> > > >> 30). > > > > > > >> > > >> >> >> The > > > > > > >> > > >> >> >> >> bandwidth is not a problem. > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> However, I found error messages from the > > > zookeeper. > > > > > > Please > > > > > > >> > see > > > > > > >> > > the > > > > > > >> > > >> >> >> >> following error messages. There are some > > > > > > >> > > >> >> >> >> BufferUnderflowException error, IOException > > > > error,etc. > > > > > > >> > > >> >> >> >> May I ask what might be the causes that > generate > > > > these > > > > > > >> error > > > > > > >> > > >> >> messages? > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> Many thanks > > > > > > >> > > >> >> >> >> zhenyu > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > > > > >> > > org.apache.zookeeper.server.Request: > > > > > > >> > > >> >> >> Ignoring > > > > > > >> > > >> >> >> >> exception during toString > > > > > > >> > > >> >> >> >> java.nio.BufferUnderflowException > > > > > > >> > > >> >> >> >> at > > > > > > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > > > > >> > > >> >> >> >> at > > > > > java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> > > > > > > >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > > > > >> > > >> >> >> >> at > > > > java.lang.String.valueOf(String.java:2827) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > java.lang.StringBuilder.append(StringBuilder.java:115) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.processRequest(CommitProcessor.java:167) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:68) > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > > > > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > > > > >> > > >> >> >> >> Exception causing close of session > > > 0x424dfd908c50009 > > > > > > >> > > >> >> >> >> due to java.io.IOException: Read error > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > > > > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > > > > >> > > >> >> >> >> IOException stack trace > > > > > > >> > > >> >> >> >> java.io.IOException: Read error > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:372) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > > > > >> > > >> >> >> >> > > > org.apache.zookeeper.server.quorum.CommitProcessor: > > > > > > >> > Processing > > > > > > >> > > >> >> request:: > > > > > > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > > > > >> > > >> cxid:0xfffffffffffffff8 > > > > > > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > > > > >> > > org.apache.zookeeper.server.Request: > > > > > > >> > > >> >> >> Ignoring > > > > > > >> > > >> >> >> >> exception during toString > > > > > > >> > > >> >> >> >> java.nio.BufferUnderflowException > > > > > > >> > > >> >> >> >> at > > > > > > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > > > > >> > > >> >> >> >> at > > > > > java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> > > > > > > >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > > > > >> > > >> >> >> >> at > > > > java.lang.String.valueOf(String.java:2827) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > java.lang.StringBuilder.append(StringBuilder.java:115) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > > > > >> > > >> >> >> >> > > org.apache.zookeeper.server.FinalRequestProcessor: > > > > > > >> Processing > > > > > > >> > > >> >> request:: > > > > > > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > > > > >> > > >> cxid:0xfffffffffffffff8 > > > > > > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > > > > >> > > org.apache.zookeeper.server.Request: > > > > > > >> > > >> >> >> Ignoring > > > > > > >> > > >> >> >> >> exception during toString > > > > > > >> > > >> >> >> >> java.nio.BufferUnderflowException > > > > > > >> > > >> >> >> >> at > > > > > > >> > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > > > > >> > > >> >> >> >> at > > > > > java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> > > > > > > >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> > > > > > > >> > > > > org.apache.log4j.or.DefaultRenderer.doRender(DefaultRenderer.java:36) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > > > > > org.apache.log4j.or.RendererMap.findAndRender(RendererMap.java:80) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.log4j.spi.LoggingEvent.getRenderedMessage(LoggingEvent.java:362) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.log4j.helpers.PatternParser$BasicPatternConverter.convert(PatternParser.java:403) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> > > > > > org.apache.log4j.PatternLayout.format(PatternLayout.java:502) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > > > > > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> > > > > > > >> > org.apache.log4j.WriterAppender.append(WriterAppender.java:160) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> > > > > > > >> > > > > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> > > org.apache.log4j.Category.callAppenders(Category.java:206) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > org.apache.log4j.Category.forcedLog(Category.java:391) > > > > > > >> > > >> >> >> >> at > > > > > > >> org.apache.log4j.Category.debug(Category.java:260) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:147) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > > > > >> > > org.apache.zookeeper.server.Request: > > > > > > >> > > >> >> >> Ignoring > > > > > > >> > > >> >> >> >> exception during toString > > > > > > >> > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > > > > >> > > >> cxid:0xfffffffffffffff8 > > > > > > >> > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 INFO > > > > > > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > > > > >> > > >> >> >> >> closing session:0x424dfd908c50009 > NIOServerCnxn: > > > > > > >> > > >> >> >> >> java.nio.channels.SocketChannel[connected > > local=/ > > > > > > >> > > >> >> 192.168.100.131:2181 > > > > > > >> > > >> >> >> remote=/ > > > > > > >> > > >> >> >> >> 192.168.100.131:35500] > > > > > > >> > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > > > > >> > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > > > > >> > > >> >> >> >> ignoring exception during input shutdown > > > > > > >> > > >> >> >> >> java.net.SocketException: Transport endpoint > is > > > not > > > > > > >> connected > > > > > > >> > > >> >> >> >> at > > > > > sun.nio.ch.SocketChannelImpl.shutdown(Native > > > > > > >> > Method) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > > > > > > > > > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> > > > > > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > > > > > > > >> > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:848) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:498) > > > > > > >> > > >> >> >> >> at > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) > > > > > > >> > > >> >> >> >> > > > > > > >> > > >> >> >> > > > > > > > >> > > >> >> >> > > > > > > >> > > >> >> > > > > > > > >> > > >> >> > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
