Hello,
Thank you for your suggestions.
Several days before We found our routing talbe has some problems, after
adjusting now we are sure that the bandwidth is ok.
And we have used lzo compression.
So we started the test program again, but after running normally for 23
hours, the master killed itself. Following is part of the log.
By the way, this time we inserted 10 webpages per second only.
2009-08-14 13:36:31,840 INFO org.apache.hadoop.hbase.master.ServerManager: 4
region servers, 0 dead, average load 48.75
2009-08-14 13:36:32,016 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 192.168.33.5:60020,
regionnam
e: .META.,,1, startKey: <>}
2009-08-14 13:36:32,076 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 192.168.33.6:60020,
regionnam
e: -ROOT-,,0, startKey: <>}
2009-08-14 13:36:32,084 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
192.168.33.6:60020
, regionname: -ROOT-,,0, startKey: <>} complete
2009-08-14 13:36:32,316 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 193 row(s) of meta region {server:
192.168.33.5:600
20, regionname: .META.,,1, startKey: <>} complete
2009-08-14 13:36:32,316 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2009-08-14 13:37:00,366 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@4a407c9f
java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:653)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897)
2009-08-14 13:37:00,881 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu3/192.168.33.8:2222
2009-08-14 13:37:04,366 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80000 to sun.nio.ch.selectionkeyi...@4ac6ee33
java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0
lim=4 cap=4]
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:653)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897)
2009-08-14 13:37:04,721 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu2/192.168.33.9:2222
2009-08-14 13:37:08,872 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@2e93ebe0
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
2009-08-14 13:37:08,873 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
2009-08-14 13:37:09,486 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu2/192.168.33.9:2222
2009-08-14 13:37:12,712 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80000 to sun.nio.ch.selectionkeyi...@7162d703
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
2009-08-14 13:37:12,713 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
2009-08-14 13:37:13,032 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu3/192.168.33.8:2222
2009-08-14 13:37:17,482 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80001 to sun.nio.ch.selectionkeyi...@1012401d
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
2009-08-14 13:37:17,483 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
2009-08-14 13:37:17,856 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu7/192.168.33.6:2222
2009-08-14 13:37:19,445 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected local=/
192.168.33.7:40923 remote
=ubuntu7/192.168.33.6:2222]
2009-08-14 13:37:19,445 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2009-08-14 13:37:21,022 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80000 to sun.nio.ch.selectionkeyi...@2e101b3a
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
2009-08-14 13:37:21,023 WARN org.apache.zookeeper.ClientCnxn: Ignoring
exception during shutdown output
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:956)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:922)
2009-08-14 13:37:21,908 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server ubuntu7/192.168.33.6:2222
2009-08-14 13:37:21,908 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected local=/
192.168.33.7:40926 remote
=ubuntu7/192.168.33.6:2222]
2009-08-14 13:37:21,909 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2009-08-14 13:37:21,911 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x22313002be80000 to sun.nio.ch.selectionkeyi...@6bdfe124
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:548)
at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:661)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:897)
2009-08-14 13:37:21,912 ERROR org.apache.hadoop.hbase.master.HMaster: Master
lost its znode, killing itself now
Regards,
LvZheng
2009/8/6 Zheng Lv <[email protected]>
> Hello,
> I adjusted the option "zookeeper.session.timeout" to 120000, and then
> restarted the hbase cluster and the test program. After running normally for
> 14
>
> hours, one of datanodes shut down. When I restarted the hadoop and hbase,
> and checked the row count of table 'webpage', I got the result of 6625,
> while the
>
> test program log telling me there should be at least 885000. There are too
> many data lost. Following is the end part of the datanode log in that
> server.
>
> 2009-08-06 04:28:32,214 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.9:45465, dest: /192.168.33.6:50010, bytes: 1214,
>
> op: HDFS_WRITE, cliID: DFSClient_1777493426, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-402434507207277902_27468
> 2009-08-06 04:28:32,214 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_-402434507207277902_27468 terminating
> 2009-08-06 04:28:32,606 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.6:50010, dest: /192.168.33.5:44924, bytes: 446,
>
> op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-2647720945992878390_27447
> 2009-08-06 04:28:32,612 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.6:50010, dest: /192.168.33.5:44925, bytes: 277022,
>
> op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-2647720945992878390_27447
> 2009-08-06 04:28:32,770 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-5186903983646527212_27469 src: /192.168.33.5:44941 dest:
>
> /192.168.33.6:50010
> 2009-08-06 04:29:35,672 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_1888582734643135148_27447 1 Exception
>
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
>
>
> local=/192.168.33.6:35418 remote=/192.168.33.5:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.DataInputStream.readFully(DataInputStream.java:178)
> at java.io.DataInputStream.readLong(DataInputStream.java:399)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853)
>
> at java.lang.Thread.run(Thread.java:619)
>
> 2009-08-06 04:29:35,673 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_1888582734643135148_27447 terminating
> 2009-08-06 04:29:35,683 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_1888582734643135148_27447
>
> java.io.EOFException: while trying to read 65557 bytes
> 2009-08-06 04:29:35,689 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_1888582734643135148_27447 received exception
>
> java.io.EOFException: while trying to read 65557 bytes
> 2009-08-06 04:29:35,689 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 192.168.33.6:50010, storageID=DS-1028185837-192.168.33.6
>
> -50010-1249268609430, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 65557 bytes
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>
> at java.lang.Thread.run(Thread.java:619)
>
>
>
>
> *************************************
>
>
>
>
> And following is part of the content of test program log.
>
> insertting 880000 webpages need 51920792 ms.
> insertting 881000 webpages need 51972741 ms.
> insertting 882000 webpages need 52024775 ms.
> 09/08/06 04:32:20 WARN zookeeper.ClientCnxn: Exception closing session
> 0x222e91bb6b90002 to sun.nio.ch.selectionkeyi...@527809c6
> java.io.IOException: TIMED OUT
> at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Attempting connection to
> server ubuntu3/192.168.33.8:2222
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Priming connection to
> java.nio.channels.SocketChannel[connected local=/192.168.33.7:52496
>
> remote=ubuntu3/192.168.
> 33.8:2222]
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Server connection successful
> insertting 883000 webpages need 52246380 ms.
> insertting 884000 webpages need 52298370 ms.
> insertting 885000 webpages need 52380479 ms.
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server Some server, retryOnlyOne=true, index=0, islastrow=true,
> tries=9,
>
> nu
> mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
> regioninfo: REGION => {NAME =>
> 'webpage,http:\x2F\x2Fnews.163.com<http://x2fnews.163.com/>
> \x2F09\x2F0803\x2F01
>
> \x2F5FO
> O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
> 'http:\x2F\x2Fnews.163.com <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696
> ', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
> FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
> TTL =>
>
> '2147
> 483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
> {NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>
>
> '2147483
> 647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
> region=webpage,http:\x2F\x2Fnews.163.com <http://x2fnews.163.com/>
> \x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.h
> tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
> x2Fnews.163.com <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696,1249504267
> 420, row 'http:\x2F\x2Fnews.163.com
> <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504668723_885781',
> but failed after 10 attempts.
> Exceptions:
>
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
> at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
> at hbasetest.InsertThread.run(InsertThread.java:26)
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server Some server, retryOnlyOne=true, index=0, islastrow=true,
> tries=9,
>
> nu
> mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
> regioninfo: REGION => {NAME =>
> 'webpage,http:\x2F\x2Fnews.163.com<http://x2fnews.163.com/>
> \x2F09\x2F0803\x2F01
>
> \x2F5FO
> O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
> 'http:\x2F\x2Fnews.163.com <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696
> ', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
> FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
> TTL =>
>
> '2147
> 483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
> {NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>
>
> '2147483
> 647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
> region=webpage,http:\x2F\x2Fnews.163.com <http://x2fnews.163.com/>
> \x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.h
> tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
> x2Fnews.163.com <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696,1249504267
> 420, row 'http:\x2F\x2Fnews.163.com
> <http://x2fnews.163.com/>\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504754735_885782',
> but failed after 10 attempts.
> Exceptions:
>
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
> at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
> at hbasetest.InsertThread.run(InsertThread.java:26)
> .
> .
> .
> .
> .
> .
> .
>
>
>
> Any suggestion?
> Thanks a lot,
> LvZheng
>
> 2009/8/5 Zheng Lv <[email protected]>
>
> Hi Stack,
>> Thank you very much for your explaination.
>> We just adjusted the value of the property "zookeeper.session.timeout"
>> to 120000, and we are observing the system now.
>> "Are nodes running on same nodes as hbase? " --Do you mean we should
>> have several servers running exclusively for zk cluster? But I'm afraid that
>> we can not have that many servers. Any suggestion?
>> We don't config the zk in zoo.cfg, but in hbase-site.xml. Following is
>> the content in hbase-site.xml about zk.
>> <property>
>> <name>hbase.zookeeper.property.clientPort</name>
>> <value>2222</value>
>> </property>
>>
>> <property>
>> <name>hbase.zookeeper.quorum</name>
>> <value>ubuntu2,ubuntu3,ubuntu7,ubuntu9,ubuntu6</value>
>> </property>
>>
>> <property>
>> <name>zookeeper.session.timeout</name>
>> <value>120000</value>
>> </property>
>>
>> Thanks a lot,
>> LvZheng
>>
>>
>>
>