Keith Turner created ACCUMULO-2677:
--------------------------------------

             Summary: Single node bottle neck during map reduce
                 Key: ACCUMULO-2677
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2677
             Project: Accumulo
          Issue Type: Improvement
    Affects Versions: 1.4.0
         Environment: 1.6.0-RC2, Hadoop 2.2.0, AWS 20 node cluster
            Reporter: Keith Turner
             Fix For: 1.7.0


While running the verification map reduce job as part of the continuous ingest 
test, I noticed the map phase was taking longer than expected.  I had run 24 
hours of ingest and then verification.   There were 2048 tablets and ~32B 
entries.  List scans showed that a lot of mappers were reading from one node.  
That single tserver was thrashing and had a much lower aggregate read rate than 
tservers that only had a few mappers reading (like ~35KV/s vs 150KV/s).

Below is the output of listscans 

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | 
TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | 
ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:35838 |    2m47s |      5ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;121e33;120f25 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h33m |    248ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;422d5;421e3d | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60511 |    2h53m |    193ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;554b7;553c5e | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h19m |    246ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;7f1e43e;7f0f3 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |   56m18s |     73ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;1bf149b;1be238 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h26m |    263ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;555a83;554b7 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |    1h47m |    131ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;42e206d;42d2f 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |    4m31s |     71ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;225a77;224b6be 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |     3h1m |    252ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;6587a;65789e | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h31m |    131ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;41f107;41e1f | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h29m |    350ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;423c6;422d5 |  
      [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h37m |    344ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;424b6f;423c6 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h17m |    253ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;1d0048;1cf14 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48103 |    3h12m |    277ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     
|3;400f13;4000000000000004 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h33m |    230ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;28b4f;28a5e2 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39470 |    2h57m |    269ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;2787b;2778a8 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h27m |    449ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;430f37;43002ac 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |     1h9m |     31ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;224b6be;223c5c 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |   49m54s |    263ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;5eb4fc;5ea5e7a 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |    2h46m |    116ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5b4b7;5b3c68e 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    1h57m |    262ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5e96d;5e87bd | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h15m |    245ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;545a7f;544b7 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    1h52m |    231ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;417895c;41698 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h30m |    192ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;54c3d4f;54b4c 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h32m |    261ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;5f004fc;5ef13f 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    2h54m |    117ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;404b67;403c5 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h29m |     34ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;6bc3f;6bb4e5d 
|        [] | {}
    ip-10-1-2-26:9997 |      10.1.2.16:45905 | 21s291ms |  6s841ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;4396ab;43879c 
|        [] | {}
    ip-10-1-2-18:9997 |      10.1.2.26:48600 |     2m2s |      5ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;4b1e32;4b0f22 
|        [] | {}
    ip-10-1-2-20:9997 |      10.1.2.21:36546 |    2m18s |  7s920ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;601e91;600f83 
|        [] | {}
{nofotmat}

Below is the output ~20 min later.

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | 
TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | 
ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:36125 |    3m10s |      3ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;1c4ba5a;1c3c9 
|        [] | {}
    ip-10-1-2-14:9997 |      10.1.2.16:35327 |     5m9s |      1ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;1b0f58c;1b004c 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h54m |    509ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;422d5;421e3d | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h40m |    251ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;7f1e43e;7f0f3 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |    1h17m |     26ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;1bf149b;1be238 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h47m |    455ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;555a83;554b7 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |     2h8m |    352ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;42e206d;42d2f 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |   25m43s |    112ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;225a77;224b6be 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |    3h22m |    113ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;6587a;65789e | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h52m |    299ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;41f107;41e1f | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h50m |     71ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;423c6;422d5 |  
      [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h58m |    160ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;424b6f;423c6 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h39m |    426ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;1d0048;1cf14 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h54m |    184ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;28b4f;28a5e2 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h48m |    263ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;430f37;43002ac 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |    1h30m |    163ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;224b6be;223c5c 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |    1h11m |    180ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5eb4fc;5ea5e7a 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |     3h7m |    317ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5b4b7;5b3c68e 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    2h19m |    160ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5e96d;5e87bd | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h36m |    238ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;545a7f;544b7 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    2h13m |    162ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;417895c;41698 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h51m |     72ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;54c3d4f;54b4c 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h53m |     27ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;5f004fc;5ef13f 
|        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    3h16m |    268ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;404b67;403c5 | 
       [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h51m |    239ms | QUEUED 
|SINGLE |    root |      ci |        [] |                     |3;6bc3f;6bb4e5d 
|        [] | {}
    ip-10-1-2-29:9997 |      10.1.2.15:56044 |  4s505ms |      3ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;602da;601e91 | 
       [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50234 | 51s534ms |      9ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;20f12;20e218f 
|        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.26:50232 |    3m10s |      5ms |RUNNING 
|SINGLE |    root |      ci |        [] |                     |3;5e4b66;5e3c5 | 
       [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50206 |    3m47s |    285ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;1ad31;1ac400b 
|        [] | {}
    ip-10-1-2-28:9997 |      10.1.2.18:38857 | 42s643ms |      6ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;2ef13;2ee229 | 
       [] | {}
    ip-10-1-2-20:9997 |      10.1.2.20:44062 |    1m23s |  4s928ms |   IDLE 
|SINGLE |    root |      ci |        [] |                     |3;6b4b7;6b3c669 
|        [] | {}
{noformat}

I am not sure what caused things to get in this situation, but I have a theory. 
 While the mappers were running a single AWS node was rebooted for some reason. 
 This would have caused tablets to migrate.  AccumuloInputFormat calculates its 
locality information up front, if tablets move mappers will run where the 
tablets used to be.  So maybe a slighty higher than avg number of tablets 
started reading from ip-10-1-2-23 as a result of the migration.  This caused 
those mapper to run slower and over time more mappers read from  ip-10-1-2-23 
and things just snowballed.

Regardless of how this situation occurred, Accumulo should handle it better 
when it does occur.  If a single tablet server has much higher number of 
clients than avg attempting to read for long periods of time, then something 
should be done.  In this case decisions could not be made off of the read rate, 
because this tserver had a much lower read rate than other tservers that only 
had 1 or 2 mappers reading.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to