Stack I am very appreciated for your comments. I will use the zookeeper monitoring script on my cluster and let it run overnight to see the result. I will follow up the thread when I get anything.
thanks zhenyu On Wed, Nov 11, 2009 at 3:52 PM, stack <[email protected]> wrote: > I see these in your log too: > > > 1. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x424dfd908c50009 after 4544ms > 2. 2009-11-11 04:27:19,018 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x24dfd90c810002 after 4548ms > 3. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x424dfd908c50009 after 9030ms > 4. 2009-11-11 04:27:43,960 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x24dfd90c810002 after 9030ms > 5. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x424dfd908c50009 after 0ms > 6. 2009-11-11 04:27:48,830 DEBUG org.apache.zookeeper.ClientCnxn: Got > ping response for sessionid:0x24dfd90c810002 after 0ms > > Which would indicate lag talking to zk too. > > Appending HLog also seems to get steadily worse. > > I also see these: > > > 1. 2009-11-11 04:27:01,233 WARN org.apache.zookeeper.ClientCnxn: > Exception closing session 0x24dfd90c810002 to > sun.nio.ch.selectionkeyi...@5e16a7e3 > 2. java.io.IOException: Read error rc = -1 > java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] > 3. at > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > 4. at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > > Were all 5 quorum members participating? Check by doing "echo stat|nc > QUORUM_HOST port" for each member of quorum.... the below might help: > > members="1,2,3,4,5" > while true; do date; for i in $members; do echo $i; echo stat | nc $i 2181; > done; sleep 10; done > > We need to improve our zk_dump. It just reports quorum members... not if > they are actually participating or not. The above will tell you if dead or > alive. > > Please report back after checking above and following the Jean-Daniel > prescription. > > Thanks, > St.Ack > > > > On Wed, Nov 11, 2009 at 12:29 PM, Zhenyu Zhong <[email protected] > >wrote: > > > Thanks a lot for the help. > > > > I understand that the change of disk configurations should be done > > progressively such that HDFS can balance and recover from a disk going > > away. > > > > The data we are trying to import into HBase is about 200million rows, > each > > row is about hundreds of bytes. Basically, we read in a text file from > > HDFS, > > then for each row in the text file, we fetch data from HBase table, we > > compare the column value and update the latest value back into HBase. I > was > > suggested to use CheckAndPut. Anyway this is the common usecase on my > > side. > > > > > > Many thanks > > zhenyu > > > > > > > > On Wed, Nov 11, 2009 at 3:03 PM, Jean-Daniel Cryans <[email protected] > > >wrote: > > > > > 6 disks, 19 machines and you see those kind of issues? I'm quite > > > surprised and I want to learn more about what you trying to to. > > > > > > But first, if hbase is managing ZK for you then you can change the > > > value of hbase.zookeeper.property.dataDir so that it points to its own > > > mount. Make sure you also change dfs.data.dir in conf/hdfs-site.xml > > > (hadoop stuff) so that it stops pointing there (you should have 6 > > > values there at the moment if you configured it correctly). Doing this > > > could result in data loss if you do it all at once so do the change > > > progressively. This also means that you may have to keep a seperate > > > HDFS configuration for those particular nodes. Please ask more > > > questions if it's not very clear. > > > > > > WRT to dedicating nodes to Zookeeeper, it makes sense only if you have > > > a lot of nodes but in your case, unless there's also CPU contention, > > > you're good with only a dedicated disk. > > > > > > Now for what you are trying to do, is this a massive import that goes > > > over many hours? How big is each row? I'd love to get a lot of details > > > to help you optimize your cluster for your particular problem. > > > > > > Thx, > > > > > > J-D > > > > > > On Wed, Nov 11, 2009 at 11:51 AM, Zhenyu Zhong < > [email protected]> > > > wrote: > > > > J-D, > > > > > > > > Thanks a lot for the insightful comments. > > > > We are running 5 zookeeper quorums now. On these 5 quorum servers, we > > > also > > > > run datanode, tasktracker and regionserver. > > > > Now we have 6 disks per machine. > > > > If it is the disk IO saturation, is there anyway to configure > zookeeper > > > to > > > > write to a seperate disk which is not the same one used by HDFS? > > > > Or shall we run zookeeper quorums on different nodes other than HDFS? > > > > > > > > Are HLogs generated by Regionservers? Would it alleviate the issue if > > we > > > use > > > > the first 14 for regionserver, and the rest of 5 for zookeeper?(In > that > > > > sense, the node running zookeeper quorum won't run regionserver) > > > > > > > > I am very appreciated for your comments. > > > > > > > > Best, > > > > zhenyu > > > > > > > > > > > > > > > > On Wed, Nov 11, 2009 at 2:30 PM, Jean-Daniel Cryans < > > [email protected] > > > >wrote: > > > > > > > >> The problem seems to be that the IO is saturated. When you see this: > > > >> > > > >> # > > > >> 2009-11-11 04:28:10,308 WARN > > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 9 on > > > >> 60021 took 6819ms appending an edit to hlog; editcount=370 > > > >> # > > > >> 2009-11-11 04:28:10,308 WARN > > > >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 2 on > > > >> 60021 took 4412ms appending an edit to hlog; editcount=371 > > > >> > > > >> It's a sign that HDFS is slow. I guess that the zookeeper quorum > > > >> members are on those same nodes using the same disks? If so, there's > > > >> the problem because Zookeeper needs to be able to commit > transactions > > > >> very fast and if your disks are already hogged then they have to > wait > > > >> and you can get disconnections. Do you still have 19 quorum members? > > > >> How many disks do you have per machine? > > > >> > > > >> Thx > > > >> > > > >> J-D > > > >> > > > >> On Wed, Nov 11, 2009 at 11:22 AM, Zhenyu Zhong < > > [email protected] > > > > > > > >> wrote: > > > >> > Please take a look at the following regionserver log. > > > >> > The error started to come out from 4:24am UTC to 4:29am UTC. > > > >> > > > > >> > http://pastebin.com/m3717d064 > > > >> > > > > >> > Thanks in advance. > > > >> > > > > >> > Best, > > > >> > zhenyu > > > >> > > > > >> > > > > >> > On Wed, Nov 11, 2009 at 2:01 PM, Jean-Daniel Cryans < > > > [email protected] > > > >> >wrote: > > > >> > > > > >> >> Could you paste a bigger log trace (interested to see at least > all > > > 300 > > > >> >> lines before that) in a pastebin? > > > >> >> > > > >> >> Thx, > > > >> >> > > > >> >> J-D > > > >> >> > > > >> >> On Wed, Nov 11, 2009 at 10:56 AM, Zhenyu Zhong < > > > [email protected] > > > >> > > > > >> >> wrote: > > > >> >> > J-D, > > > >> >> > > > > >> >> > Thanks for the reply. > > > >> >> > > > > >> >> > The problem I have is one or more of the regionservers > disconnect > > > from > > > >> >> the > > > >> >> > cluster. > > > >> >> > > > > >> >> > I checked the resource usage including memory Heap usage for > > > >> >> regionserver, > > > >> >> > and found nothing suspicious. The regionserver heap usage is > way > > > below > > > >> >> the > > > >> >> > maxHeap defined in jvm. > > > >> >> > > > > >> >> > So I checked zookeeper log, then I found the following > > > >> >> > BufferUnderflowException, eventually it leads to > SocketException: > > > >> >> Transport > > > >> >> > endpoint is not connected. > > > >> >> > > > > >> >> > > > > >> >> > I also checked the regionserver log. It looks like the > > regionserver > > > >> tried > > > >> >> to > > > >> >> > connect to other zookeeper quorum server. It connects > > successfully, > > > >> but > > > >> >> then > > > >> >> > it complained about SocketException: transport endpoint is not > > > >> connected. > > > >> >> > > > > >> >> > So both zookeeper log and regionserver log told me that the > > > transport > > > >> >> > endpoint is not connected. I don't know why. > > > >> >> > > > > >> >> > I would be very appreciated if you could point me to the right > > > >> direction. > > > >> >> > > > > >> >> > Best, > > > >> >> > > > > >> >> > > > > >> >> > Regionserver log: > > > >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn: > > > >> Attempting > > > >> >> > connection to server superpyxis0018/192.168.100.132:2181 > > > >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn: > > > Priming > > > >> >> > connection to java.nio.channels.SocketChannel[connected local=/ > > > >> >> > 192.168.100.131:43065 remote=superpyxis0018/ > 192.168.100.132:2181 > > ] > > > >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn: > > > Server > > > >> >> > connection successful > > > >> >> > 2009-11-11 04:24:53,768 INFO > > > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got > ZooKeeper > > > >> event, > > > >> >> > state: Disconnected, type: None, path: null > > > >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn: > > > >> Attempting > > > >> >> > connection to server superpyxis0017/192.168.100.131:2181 > > > >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn: > > > Priming > > > >> >> > connection to java.nio.channels.SocketChannel[connected local=/ > > > >> >> > 192.168.100.131:46903 remote=superpyxis0017/ > 192.168.100.131:2181 > > ] > > > >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn: > > > Server > > > >> >> > connection successful > > > >> >> > 2009-11-11 04:24:54,118 WARN org.apache.zookeeper.ClientCnxn: > > > >> Exception > > > >> >> > closing session 0x24dfd90c810002 to > > > >> sun.nio.ch.selectionkeyi...@2dd78ccf > > > >> >> > java.io.IOException: Read error rc = -1 > > > >> java.nio.DirectByteBuffer[pos=0 > > > >> >> > lim=4 cap=4] > > > >> >> > at > > > >> >> > > > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > > > >> >> > at > > > >> >> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > > >> >> > 2009-11-11 04:24:54,118 WARN org.apache.zookeeper.ClientCnxn: > > > Ignoring > > > >> >> > exception during shutdown input > > > >> >> > java.net.SocketException: Transport endpoint is not connected > > > >> >> > at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) > > > >> >> > at > > > >> >> > > > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) > > > >> >> > at > > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) > > > >> >> > at > > > >> >> > > > > >> > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) > > > >> >> > at > > > >> >> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > On Wed, Nov 11, 2009 at 1:40 PM, Jean-Daniel Cryans < > > > >> [email protected] > > > >> >> >wrote: > > > >> >> > > > > >> >> >> I'm really not sure exactly what the situation is with the > > > >> >> >> java.nio.BufferUnderflowException, but what do you mean by > > > >> >> >> "regionserver disconnection"? Could you give more details? > > > >> >> >> > > > >> >> >> Thx, > > > >> >> >> > > > >> >> >> J-D > > > >> >> >> > > > >> >> >> On Wed, Nov 11, 2009 at 10:34 AM, Zhenyu Zhong < > > > >> [email protected] > > > >> >> > > > > >> >> >> wrote: > > > >> >> >> > I searched around and found HBASE-1645 "HBase could be a > > little > > > >> more > > > >> >> >> dainty > > > >> >> >> > when zk is pulled out from under it" > > > >> >> >> > > > > >> >> >> > I am wondering whether my problem is related to this > > HBase-1645. > > > Or > > > >> >> >> probably > > > >> >> >> > has nothing to do with it? > > > >> >> >> > > > > >> >> >> > thanks > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > On Wed, Nov 11, 2009 at 12:02 PM, Zhenyu Zhong < > > > >> >> [email protected] > > > >> >> >> >wrote: > > > >> >> >> > > > > >> >> >> >> Hi, > > > >> >> >> >> > > > >> >> >> >> I have been monitoring the system resources such as memory > > Heap > > > >> usage > > > >> >> >> for > > > >> >> >> >> Regionserver, load average, network bandwidth, etc. > > > >> >> >> >> > > > >> >> >> >> I experienced a disconnection last night. When I checked > the > > > >> memory > > > >> >> Heap > > > >> >> >> >> for Regionserver, it was 2336 which is way below the max > Heap > > > >> 4096. I > > > >> >> >> don't > > > >> >> >> >> think it is Regionserver problem. The requests that the > > > >> regionserver > > > >> >> got > > > >> >> >> was > > > >> >> >> >> about 126 which doesn't seem to be high. The load average > is > > > kind > > > >> of > > > >> >> >> high > > > >> >> >> >> >100. (but I saw disconnections while the load average is > > > around > > > >> 30). > > > >> >> >> The > > > >> >> >> >> bandwidth is not a problem. > > > >> >> >> >> > > > >> >> >> >> However, I found error messages from the zookeeper. Please > > see > > > the > > > >> >> >> >> following error messages. There are some > > > >> >> >> >> BufferUnderflowException error, IOException error,etc. > > > >> >> >> >> May I ask what might be the causes that generate these > error > > > >> >> messages? > > > >> >> >> >> > > > >> >> >> >> Many thanks > > > >> >> >> >> zhenyu > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > org.apache.zookeeper.server.Request: > > > >> >> >> Ignoring > > > >> >> >> >> exception during toString > > > >> >> >> >> java.nio.BufferUnderflowException > > > >> >> >> >> at > > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > >> >> >> >> at java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > >> >> >> >> at > > > >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > >> >> >> >> at java.lang.String.valueOf(String.java:2827) > > > >> >> >> >> at > > > java.lang.StringBuilder.append(StringBuilder.java:115) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.processRequest(CommitProcessor.java:167) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:68) > > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > >> >> >> >> Exception causing close of session 0x424dfd908c50009 > > > >> >> >> >> due to java.io.IOException: Read error > > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > >> >> >> >> IOException stack trace > > > >> >> >> >> java.io.IOException: Read error > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> > > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:372) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) > > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > >> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor: > > Processing > > > >> >> request:: > > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > >> cxid:0xfffffffffffffff8 > > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > org.apache.zookeeper.server.Request: > > > >> >> >> Ignoring > > > >> >> >> >> exception during toString > > > >> >> >> >> java.nio.BufferUnderflowException > > > >> >> >> >> at > > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > >> >> >> >> at java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > >> >> >> >> at > > > >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > >> >> >> >> at java.lang.String.valueOf(String.java:2827) > > > >> >> >> >> at > > > java.lang.StringBuilder.append(StringBuilder.java:115) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) > > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > >> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor: > Processing > > > >> >> request:: > > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > >> cxid:0xfffffffffffffff8 > > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > org.apache.zookeeper.server.Request: > > > >> >> >> Ignoring > > > >> >> >> >> exception during toString > > > >> >> >> >> java.nio.BufferUnderflowException > > > >> >> >> >> at > > java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127) > > > >> >> >> >> at java.nio.ByteBuffer.get(ByteBuffer.java:675) > > > >> >> >> >> at > > > >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199) > > > >> >> >> >> at > > > >> >> >> >> > > > >> > org.apache.log4j.or.DefaultRenderer.doRender(DefaultRenderer.java:36) > > > >> >> >> >> at > > > >> >> >> >> > > > org.apache.log4j.or.RendererMap.findAndRender(RendererMap.java:80) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.log4j.spi.LoggingEvent.getRenderedMessage(LoggingEvent.java:362) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.log4j.helpers.PatternParser$BasicPatternConverter.convert(PatternParser.java:403) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65) > > > >> >> >> >> at > > > >> >> org.apache.log4j.PatternLayout.format(PatternLayout.java:502) > > > >> >> >> >> at > > > >> >> >> >> > > > org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359) > > > >> >> >> >> at > > > >> >> >> > org.apache.log4j.WriterAppender.append(WriterAppender.java:160) > > > >> >> >> >> at > > > >> >> >> >> > > > >> > org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > > > >> >> >> >> at > > > >> org.apache.log4j.Category.callAppenders(Category.java:206) > > > >> >> >> >> at > > > org.apache.log4j.Category.forcedLog(Category.java:391) > > > >> >> >> >> at > org.apache.log4j.Category.debug(Category.java:260) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:147) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) > > > >> >> >> >> 2009-11-11 04:29:00,810 WARN > > > org.apache.zookeeper.server.Request: > > > >> >> >> Ignoring > > > >> >> >> >> exception during toString > > > >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches > > > >> cxid:0xfffffffffffffff8 > > > >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a > > > >> >> >> >> 2009-11-11 04:29:00,810 INFO > > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > >> >> >> >> closing session:0x424dfd908c50009 NIOServerCnxn: > > > >> >> >> >> java.nio.channels.SocketChannel[connected local=/ > > > >> >> 192.168.100.131:2181 > > > >> >> >> remote=/ > > > >> >> >> >> 192.168.100.131:35500] > > > >> >> >> >> 2009-11-11 04:29:00,810 DEBUG > > > >> >> org.apache.zookeeper.server.NIOServerCnxn: > > > >> >> >> >> ignoring exception during input shutdown > > > >> >> >> >> java.net.SocketException: Transport endpoint is not > connected > > > >> >> >> >> at sun.nio.ch.SocketChannelImpl.shutdown(Native > > Method) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> > > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) > > > >> >> >> >> at > > > >> >> >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> > > > org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:848) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> > > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:498) > > > >> >> >> >> at > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > > > > org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221) > > > >> >> >> >> > > > >> >> >> > > > > >> >> >> > > > >> >> > > > > >> >> > > > >> > > > > >> > > > > > > > > > >
