[ 
https://issues.apache.org/jira/browse/ACCUMULO-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated ACCUMULO-2677:
---------------------------------
    Fix Version/s:     (was: 1.7.0)
                   1.8.0

> Single node bottle neck during map reduce
> -----------------------------------------
>
>                 Key: ACCUMULO-2677
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2677
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.4.0
>         Environment: 1.6.0-RC2, Hadoop 2.2.0, AWS 20 node cluster
>            Reporter: Keith Turner
>             Fix For: 1.8.0
>
>
> While running the verification map reduce job as part of the continuous 
> ingest test, I noticed the map phase was taking longer than expected.  I had 
> run 24 hours of ingest and then verification.   There were 2048 tablets and 
> ~32B entries.  List scans showed that a lot of mappers were reading from one 
> node.  That single tserver was thrashing and had a much lower aggregate read 
> rate than tservers that only had a few mappers reading (like ~35KV/s vs 
> 150KV/s).
> Below is the output of listscans 
> {noformat}
> root@test160> listscans
>  TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | 
> TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | 
> ITERATORS  | ITERATOR OPTIONS
>     ip-10-1-2-15:9997 |      10.1.2.14:35838 |    2m47s |      5ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;121e33;120f25 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h33m |    248ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;422d5;421e3d 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60511 |    2h53m |    193ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;554b7;553c5e 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h19m |    246ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;7f1e43e;7f0f3 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.25:55164 |   56m18s |     73ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;1bf149b;1be238 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h26m |    263ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;555a83;554b7 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60869 |    1h47m |    131ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;42e206d;42d2f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:59576 |    4m31s |     71ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;225a77;224b6be |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35342 |     3h1m |    252ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;6587a;65789e 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h31m |    131ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;41f107;41e1f 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h29m |    350ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;423c6;422d5 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h37m |    344ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;424b6f;423c6 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h17m |    253ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;1d0048;1cf14 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48103 |    3h12m |    277ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;400f13;4000000000000004 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h33m |    230ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;28b4f;28a5e2 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39470 |    2h57m |    269ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;2787b;2778a8 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h27m |    449ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;430f37;43002ac |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:32894 |     1h9m |     31ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;224b6be;223c5c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36351 |   49m54s |    263ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;5eb4fc;5ea5e7a |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48227 |    2h46m |    116ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;5b4b7;5b3c68e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39676 |    1h57m |    262ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;5e96d;5e87bd 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h15m |    245ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;545a7f;544b7 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35745 |    1h52m |    231ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;417895c;41698 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h30m |    192ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;54c3d4f;54b4c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h32m |    261ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;5f004fc;5ef13f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42506 |    2h54m |    117ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;404b67;403c5 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h29m |     34ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;6bc3f;6bb4e5d |        [] | {}
>     ip-10-1-2-26:9997 |      10.1.2.16:45905 | 21s291ms |  6s841ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;4396ab;43879c |        [] | {}
>     ip-10-1-2-18:9997 |      10.1.2.26:48600 |     2m2s |      5ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;4b1e32;4b0f22 |        [] | {}
>     ip-10-1-2-20:9997 |      10.1.2.21:36546 |    2m18s |  7s920ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;601e91;600f83 |        [] | {}
> {noformat}
> Below is the output ~20 min later.
> {noformat}
> root@test160> listscans
>  TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | 
> TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | 
> ITERATORS  | ITERATOR OPTIONS
>     ip-10-1-2-15:9997 |      10.1.2.14:36125 |    3m10s |      3ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;1c4ba5a;1c3c9 |        [] | {}
>     ip-10-1-2-14:9997 |      10.1.2.16:35327 |     5m9s |      1ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;1b0f58c;1b004c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h54m |    509ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;422d5;421e3d 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h40m |    251ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;7f1e43e;7f0f3 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.25:55164 |    1h17m |     26ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;1bf149b;1be238 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h47m |    455ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;555a83;554b7 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60869 |     2h8m |    352ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;42e206d;42d2f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:59576 |   25m43s |    112ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;225a77;224b6be |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35342 |    3h22m |    113ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;6587a;65789e 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h52m |    299ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;41f107;41e1f 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h50m |     71ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;423c6;422d5 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h58m |    160ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;424b6f;423c6 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h39m |    426ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;1d0048;1cf14 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h54m |    184ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     |3;28b4f;28a5e2 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h48m |    263ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;430f37;43002ac |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:32894 |    1h30m |    163ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;224b6be;223c5c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36351 |    1h11m |    180ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;5eb4fc;5ea5e7a |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48227 |     3h7m |    317ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;5b4b7;5b3c68e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39676 |    2h19m |    160ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;5e96d;5e87bd 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h36m |    238ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;545a7f;544b7 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35745 |    2h13m |    162ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;417895c;41698 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h51m |     72ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;54c3d4f;54b4c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h53m |     27ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;5f004fc;5ef13f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42506 |    3h16m |    268ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;404b67;403c5 
> |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h51m |    239ms | QUEUED 
> |SINGLE |    root |      ci |        [] |                     
> |3;6bc3f;6bb4e5d |        [] | {}
>     ip-10-1-2-29:9997 |      10.1.2.15:56044 |  4s505ms |      3ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     |3;602da;601e91 
> |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.21:50234 | 51s534ms |      9ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     
> |3;20f12;20e218f |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.26:50232 |    3m10s |      5ms |RUNNING 
> |SINGLE |    root |      ci |        [] |                     |3;5e4b66;5e3c5 
> |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.21:50206 |    3m47s |    285ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;1ad31;1ac400b |        [] | {}
>     ip-10-1-2-28:9997 |      10.1.2.18:38857 | 42s643ms |      6ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     |3;2ef13;2ee229 
> |        [] | {}
>     ip-10-1-2-20:9997 |      10.1.2.20:44062 |    1m23s |  4s928ms |   IDLE 
> |SINGLE |    root |      ci |        [] |                     
> |3;6b4b7;6b3c669 |        [] | {}
> {noformat}
> I am not sure what caused things to get in this situation, but I have a 
> theory.  While the mappers were running a single AWS node was rebooted for 
> some reason.  This would have caused tablets to migrate.  AccumuloInputFormat 
> calculates its locality information up front, if tablets move mappers will 
> run where the tablets used to be.  So maybe a slighty higher than avg number 
> of tablets started reading from ip-10-1-2-23 as a result of the migration.  
> This caused those mapper to run slower and over time more mappers read from  
> ip-10-1-2-23 and things just snowballed.
> Regardless of how this situation occurred, Accumulo should handle it better 
> when it does occur.  If a single tablet server has much higher number of 
> clients than avg attempting to read for long periods of time, then something 
> should be done.  In this case decisions could not be made off of the read 
> rate, because this tserver had a much lower read rate than other tservers 
> that only had 1 or 2 mappers reading.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to