I've seen this before, this is a GC pause that took more than a minute. It was enough to time out the scanner leases and zookeeper session.
J-D On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan <kan...@facebook.com> wrote: > I was doing some brute force testing - running one instance of > PerformanceEvaluation (PE) for writes, and another instance for randomReads. > > One of the region servers went down after a while. [This is on 0.20.2]. The > region server log had things like: > > 2010-01-25 11:33:39,416 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC > Server handler 34 on 60020 took 65190ms appending an edit to hlog; > editcount=27878 > 2010-01-25 11:33:39,416 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > Cache Stats: Sizes: Total=1815.4441MB (1903631104), Free=581.23096MB > (609464832),\ > Max=2396.675MB (2513095936), Counts: Blocks=28216, Access=10904689, > Hit=9313612, Miss=1591077, Evictions=110, Evicted=414336, Ratios: Hit > Ratio=85.40924191474\ > 915%, Miss Ratio=14.590759575366974%, Evicted/Run=3766.69091796875 > 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC > Server handler 52 on 60020 took 65195ms appending an edit to hlog; > editcount=27879 > 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC > Server handler 75 on 60020 took 65195ms appending an edit to hlog; > editcount=27880 > 2010-01-25 11:33:39,417 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > -5220035164735014996 lease expired > > This was followed by: > 2010-01-25 11:33:39,636 WARN org.apache.zookeeper.ClientCnxn: Exception > closing session 0x12652c620520001 to > sun.nio.ch.selectionkeyi...@788ab708<mailto:sun.nio.ch.selectionkeyi...@788ab708> > java.io.IOException: TIMED OUT > > And also errors like: > 2010-01-25 11:33:39,692 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: > org.apache.hadoop.hbase.UnknownScannerException: Name: -5220035164735014996 > > Eventually, the region server shutdown. > 2010-01-25 11:33:40,293 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 85 on 60020, call get([...@2eec23d2, row=0007555231, maxVersions=1, > timeRange=[0,9223372036854775807), families={(family=info, columns={data}}) > from 10.131.1.186:51645: error: java.io.IOException: Server not running, > aborting > > Looks like for some reason the region server got overloaded, and was not able > to finish the scan (I am assuming this is the period meta/root region scan > initiated by the master). > > Any thoughts/ideas on the above? > > Subsequently, I restarted the region server and just started the PE in > sequentialWrite mode. It seems to be off to a really slow start. For several > regions, it is printing the following: > > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: > locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 > because: region offline: TestTable,0031841132,1264121653303 > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Removed > .META.,,1 for tableName=.META. from cache because of > TestTable,0032000000,99999999999999 > 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Cached > location for .META.,,1 is 10.129.68.212:60020 > 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: > locateRegionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 > because: region offline: TestTable,0031841132,1264121653303 > 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: Removed > .META.,,1 for tableName=.META. from cache because of > TestTable,0032000000,99999999999999 > [..and so on..] > 10/01/25 18:12:59 DEBUG client.HConnectionManager$TableServers: Cached > location for .META.,,1 is 10.129.68.212:60020 > 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: > locateRegionInMeta attempt 8 of 10 failed; retrying after sleep of 16000 > because: region offline: TestTable,0031841132,1264121653303 > 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: Removed > .META.,,1 for tableName=.META. from cache because of > TestTable,0032000000,99999999999999 > > When I scan the '.META.', I noticed that most of the regions for which I get > the above error were in the middle of (?) a split. For example: > > TestTable,0031841132,126412 column=info:regioninfo, timestamp=1264470099796, > value=REGION => {NAME => 'TestTa > 1653303 ble,0031841132,1264121653303', STARTKEY => > '0031841132', ENDKEY => '0032057774', > ENCODED => 644500792, OFFLINE => true, SPLIT => > true, TABLE => {{NAME => 'TestTab > le', FAMILIES => [{NAME => 'info', VERSIONS => > '3', COMPRESSION => 'NONE', TTL => > '2147483647', BLOCKSIZE => '65536', IN_MEMORY => > 'false', BLOCKCACHE => 'true'}] > }} > TestTable,0031841132,126412 column=info:server, timestamp=1264121655148, > value=10.129.68.214:60020 > 1653303 > TestTable,0031841132,126412 column=info:serverstartcode, > timestamp=1264121655148, value=1264109117245 > 1653303 > TestTable,0031841132,126412 column=info:splitA, timestamp=1264470099796, > value=\000\n0031978284\000\000\000\0 > 1653303 > 01&g\016\357\277\275\024\"TestTable,0031841132,1264449401108\000\n0031841132\000\ > > 000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0 > > 05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0 > > 00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0 > > 00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\ > > 000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M > > EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004true\357\277\27 > 5\337\274# > TestTable,0031841132,126412 column=info:splitB, timestamp=1264470099796, > value=\000\n0032057774\000\000\000\0 > 1653303 > 01&g\016\357\277\275\024\"TestTable,0031978284,1264449401108\000\n0031978284\000\ > > 000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0 > > 05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0 > > 00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0 > > 00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\ > > 000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M > > EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004truej\357\277\2 > 75\312\262 > > Any thoughts/ideas on what might be going on? Appreciate any input in helping > understand what might be going on. > > Regards, > Kannan > > >