What J-D said but with ornamentation.  See below.

On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan
<kan...@facebook.com> wrote:
> I was doing some brute force testing - running one instance of 
> PerformanceEvaluation (PE) for writes, and another instance for randomReads.
>

Whats the cluster your hitting like?  That 5-node thingy?  Whats the
hardware profile.


> One of the region servers went down after a while. [This is on 0.20.2]. The 
> region server log had things like:
>
> 2010-01-25 11:33:39,416 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC 
> Server handler 34 on 60020 took 65190ms appending an edit to hlog; 
> editcount=27878

This is saying that it took 65 seconds to append to hdfs.  What was
going on at that time?  A fat GC in the regionserver or over in a
Datanode?  You can enable GC logging uncommenting stuff in the
hbase-env.sh.   Feed the GC log to https://gchisto.dev.java.net/
(suggested by the zookeeper lads).  Its good for finding the long
pauses.  We should find the logs around the long GC pause.  Its
probably a failed promotion that brought on the stop-the-world GC.  Or
your HDFS was struggling?

I was going to say that the PE puts the worse kind of loading on the
hbase cache -- nothing sticks around -- but looking at your numbers
below, cache seems to be working pretty well.

> 2010-01-25 11:33:39,416 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
> Cache Stats: Sizes: Total=1815.4441MB (1903631104), Free=581.23096MB 
> (609464832),\
>  Max=2396.675MB (2513095936), Counts: Blocks=28216, Access=10904689, 
> Hit=9313612, Miss=1591077, Evictions=110, Evicted=414336, Ratios: Hit 
> Ratio=85.40924191474\
> 915%, Miss Ratio=14.590759575366974%, Evicted/Run=3766.69091796875
> 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC 
> Server handler 52 on 60020 took 65195ms appending an edit to hlog; 
> editcount=27879
> 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC 
> Server handler 75 on 60020 took 65195ms appending an edit to hlog; 
> editcount=27880
> 2010-01-25 11:33:39,417 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 
> -5220035164735014996 lease expired
>
> This was followed by:
> 2010-01-25 11:33:39,636 WARN org.apache.zookeeper.ClientCnxn: Exception 
> closing session 0x12652c620520001 to 
> sun.nio.ch.selectionkeyi...@788ab708<mailto:sun.nio.ch.selectionkeyi...@788ab708>
> java.io.IOException: TIMED OUT

This is the zk timeout.

>
> And also errors like:
> 2010-01-25 11:33:39,692 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.UnknownScannerException: Name: -5220035164735014996
>

These will happen after the above.  The regionserver is on its way
down.  Probably emptied the list of outstanding regions.


> Subsequently, I restarted the region server and just started the PE in 
> sequentialWrite mode. It seems to be off to a really slow start. For several 
> regions, it is printing the following:
>
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: 
> locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 
> because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Removed 
> .META.,,1 for tableName=.META. from cache because of 
> TestTable,0032000000,99999999999999
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Cached 
> location for .META.,,1 is 10.129.68.212:60020
> 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: 
> locateRegionInMeta attempt 1 of 10 failed; retrying after sleep of 1000 
> because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: Removed 
> .META.,,1 for tableName=.META. from cache because of 
> TestTable,0032000000,99999999999999
> [..and so on..]
> 10/01/25 18:12:59 DEBUG client.HConnectionManager$TableServers: Cached 
> location for .META.,,1 is 10.129.68.212:60020
> 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: 
> locateRegionInMeta attempt 8 of 10 failed; retrying after sleep of 16000 
> because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: Removed 
> .META.,,1 for tableName=.META. from cache because of 
> TestTable,0032000000,99999999999999
>
> When I scan the '.META.', I noticed that most of the regions for which I get 
> the above error were in the middle of (?) a split. For example:
>
> TestTable,0031841132,126412 column=info:regioninfo, timestamp=1264470099796, 
> value=REGION => {NAME => 'TestTa
>  1653303                     ble,0031841132,1264121653303', STARTKEY => 
> '0031841132', ENDKEY => '0032057774',
>                             ENCODED => 644500792, OFFLINE => true, SPLIT => 
> true, TABLE => {{NAME => 'TestTab
>                             le', FAMILIES => [{NAME => 'info', VERSIONS => 
> '3', COMPRESSION => 'NONE', TTL =>
>                              '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 
> 'false', BLOCKCACHE => 'true'}]
>                             }}
>  TestTable,0031841132,126412 column=info:server, timestamp=1264121655148, 
> value=10.129.68.214:60020
>  1653303
>  TestTable,0031841132,126412 column=info:serverstartcode, 
> timestamp=1264121655148, value=1264109117245
>  1653303
>  TestTable,0031841132,126412 column=info:splitA, timestamp=1264470099796, 
> value=\000\n0031978284\000\000\000\0
>  1653303                     
> 01&g\016\357\277\275\024\"TestTable,0031841132,1264449401108\000\n0031841132\000\
>                             
> 000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0
>                             
> 05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0
>                             
> 00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0
>                             
> 00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\
>                             
> 000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M
>                             
> EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004true\357\277\27
>                             5\337\274#
>  TestTable,0031841132,126412 column=info:splitB, timestamp=1264470099796, 
> value=\000\n0032057774\000\000\000\0
>  1653303                     
> 01&g\016\357\277\275\024\"TestTable,0031978284,1264449401108\000\n0031978284\000\
>                             
> 000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0
>                             
> 05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0
>                             
> 00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0
>                             
> 00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\
>                             
> 000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M
>                             
> EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004truej\357\277\2
>                             75\312\262
>
> Any thoughts/ideas on what might be going on? Appreciate any input in helping 
> understand what might be going on.
>

There is no line after the above?  A region with a startkey of
0031841132?  Then its a dropped edit of .META.  The parent region --
the one being split got updates which added the splitA and splitB and
its offlining but the new region didn't get inserted?  The crash
happened just at that time?

It would be interesting to take a look at the regionserver logs from
that time.  Please post if you have a moment so we can take a looksee.

Above kinda thing is what the master rewrite is about moving state
transitions up to zk so atomic over cluster moving regions through
transitions rather than as here, a multi-row update might not all go
through as things currently work.

St.Ack

> Regards,
> Kannan
>
>
>

Reply via email to