I am trying to run an update (inserts and modify operations) experiment on a
100 TB existing database..
The job is a map-only job that doesn't use TableOutputFormat (since mappers
cannot possibly know the split anyways, so I cant optimize on locality here of
course unless I customize it on my own.. I will be shortly doing that..)..
I have a program that generates Puts on existing rows and new inserts within
a specified range.. I use hadoop streaming to run the program as individual
tasks on specified ranges..
(A side question is if it is better to use TableOutputFormat with reduce jobs:
the cost of sort-and-shuffle operations seem to be the same in both the cases...
In my case, shuffle will happen anyways.. Sorts will happen in each region
server.. The maps are at coarser shard ranges: 6 maps per node as against 170
regions per node.. )
Existing setup:
180 nodes, 6 map tasks per node, 100 TB existing db, 10 TB of updates.. Logs
disabled.. 4 gig regions.. 12 MB write buffer size in each client.. Single
column family and single value of 28 KB for each row: no compression in the
expts (will soon be doing expts with compressed data)..
Results: Around 130 region servers crashed.. All of them had logs of the
following form: The first set of data isnt the root for crashes.. They seem to
be put errors because of a possible socket timeout.. Is that right?
The second set of logs below show the point where it crashed.. It looks like a
zk timeout.. Were the RS's too busy to ping the ZK quorum?
Thank you in advance :)
Vidhya
The logs:
2010-08-04 08:05:42,886 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested
for region DocData,0000000149473989,1280881832403/571301273 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested
for region DocData,0000000148937993,1280881832423/196751683 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,886 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested
for region DocData,0000000147751972,1280881832465/1982268997 because:
regionserver/63.250.207.82:60020.cacheFlusher2010-08-04 08:05:42,900 WARN
org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call
put([...@661cc0d6, [Lorg.apache.hadoop.hbase.client.Put;@69b855f9) from
63.250.207.23:37600: output error
2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call put([...@460ca540,
[Lorg.apache.hadoop.hbase.client.Put;@683530ad) from 98.136.191.232:59388:
output error
2010-08-04 08:05:42,901 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call put([...@631e978e,
[Lorg.apache.hadoop.hbase.client.Put;@6d701e78) from 63.250.207.113:44020:
output error
2010-08-04 08:05:42,913 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call put([...@685060b9,
[Lorg.apache.hadoop.hbase.client.Put;@61f39aaa) from 98.136.191.122:40316:
output error
2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 87 on 60020 caught: java.nio.channels.ClosedChannelException at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 86 on 60020 caught: java.nio.channels.ClosedChannelException at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
2010-08-04 08:05:42,916 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 85 on 60020 caught: java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
THE SECOND PART OF LOGS WHERE CRASHES OCCURRED...
2010-08-04 09:30:29,512 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:549)
at java.lang.Thread.run(Thread.java:619)
2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing
session 0x32a206703ca0616 to sun.nio.ch.selectionkeyi...@3d2292cd
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-08-04 09:30:29,512 WARN org.apache.zookeeper.ClientCnxn: Exception closing
session 0x32a206703ca0014 to sun.nio.ch.selectionkeyi...@36f0b7f8
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-08-04 09:30:29,685 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state:
Disconnected, type: None, path: null2010-08-04 09:30:29,695 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got
ZooKeeper event, state: Disconnected, type: None, path:
null
2010-08-04 09:30:29,897 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/63.250.207.82:40417 remo
te=b3122695.yst.yahoo.net/98.136.191.163:2181]
2010-08-04 09:30:29,898 INFO org.apache.zookeeper.ClientCnxn: Server connection
successful
2010-08-04 09:30:29,901 DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: Got
ZooKeeper event, state: Expired, type: None, path: null
2010-08-04 09:30:29,901 WARN org.apache.zookeeper.ClientCnxn: Exception closing
session 0x32a206703ca0616 to sun.nio.ch.selectionkeyi...@7935c2a8
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ZooKeeper: Closing session:
0x32a206703ca0616
2010-08-04 09:30:29,901 INFO org.apache.zookeeper.ClientCnxn: Closing
ClientCnxn for session: 0x32a206703ca0616
2010-08-04 09:30:29,948 INFO org.apache.zookeeper.ClientCnxn: Disconnecting
ClientCnxn for session: 0x32a206703ca0616
2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ZooKeeper: Session:
0x32a206703ca0616 closed
2010-08-04 09:30:29,949 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with
ZooKeeper
2010-08-04 09:30:29,949 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2010-08-04 09:30:30,288 INFO org.apache.zookeeper.ClientCnxn: Attempting
connection to server b3122695.yst.yahoo.net/98.136.191.163:2181
2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/63.250.207.82:40418 remote=b3122695.yst.yahoo.net/98.136.191.163:2181]
2010-08-04 09:30:30,289 INFO org.apache.zookeeper.ClientCnxn: Server connection
successful
2010-08-04 09:30:30,291 WARN org.apache.zookeeper.ClientCnxn: Exception closing
session 0x32a206703ca0014 to sun.nio.ch.selectionkeyi...@3dc37e55
java.io.IOException: Session Expired
at
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
2010-08-04 09:30:30,291 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state:
Expired, type: None, path: null
2010-08-04 09:30:30,291 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2010-08-04 09:30:30,292 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=162, stores=162, storefiles=280, storefileIndexSize=58,
memstoreSize=1724, compactionQueueSize=63, usedHeap=1877, maxHeap=7999,
blockCacheSize=20643688, blockCacheFree=2495761816, blockCacheCount=0,
blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0