[
https://issues.apache.org/jira/browse/ACCUMULO-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Elser updated ACCUMULO-2677:
---------------------------------
Fix Version/s: (was: 1.7.0)
1.8.0
> Single node bottle neck during map reduce
> -----------------------------------------
>
> Key: ACCUMULO-2677
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2677
> Project: Accumulo
> Issue Type: Improvement
> Affects Versions: 1.4.0
> Environment: 1.6.0-RC2, Hadoop 2.2.0, AWS 20 node cluster
> Reporter: Keith Turner
> Fix For: 1.8.0
>
>
> While running the verification map reduce job as part of the continuous
> ingest test, I noticed the map phase was taking longer than expected. I had
> run 24 hours of ingest and then verification. There were 2048 tablets and
> ~32B entries. List scans showed that a lot of mappers were reading from one
> node. That single tserver was thrashing and had a much lower aggregate read
> rate than tservers that only had a few mappers reading (like ~35KV/s vs
> 150KV/s).
> Below is the output of listscans
> {noformat}
> root@test160> listscans
> TABLET SERVER | CLIENT | AGE | LAST | STATE |
> TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET |
> ITERATORS | ITERATOR OPTIONS
> ip-10-1-2-15:9997 | 10.1.2.14:35838 | 2m47s | 5ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;121e33;120f25 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42586 | 2h33m | 248ms |RUNNING
> |SINGLE | root | ci | [] | |3;422d5;421e3d
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.18:60511 | 2h53m | 193ms | QUEUED
> |SINGLE | root | ci | [] | |3;554b7;553c5e
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.13:40589 | 2h19m | 246ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;7f1e43e;7f0f3 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.25:55164 | 56m18s | 73ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;1bf149b;1be238 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42618 | 2h26m | 263ms |RUNNING
> |SINGLE | root | ci | [] | |3;555a83;554b7
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:60869 | 1h47m | 131ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;42e206d;42d2f | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.14:59576 | 4m31s | 71ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;225a77;224b6be | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.27:35342 | 3h1m | 252ms |RUNNING
> |SINGLE | root | ci | [] | |3;6587a;65789e
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36073 | 2h31m | 131ms | QUEUED
> |SINGLE | root | ci | [] | |3;41f107;41e1f
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.13:40526 | 2h29m | 350ms |RUNNING
> |SINGLE | root | ci | [] | |3;423c6;422d5
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.18:60560 | 2h37m | 344ms |RUNNING
> |SINGLE | root | ci | [] | |3;424b6f;423c6
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.29:45044 | 1h17m | 253ms | QUEUED
> |SINGLE | root | ci | [] | |3;1d0048;1cf14
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.20:48103 | 3h12m | 277ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;400f13;4000000000000004 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36053 | 2h33m | 230ms |RUNNING
> |SINGLE | root | ci | [] | |3;28b4f;28a5e2
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.15:39470 | 2h57m | 269ms |RUNNING
> |SINGLE | root | ci | [] | |3;2787b;2778a8
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.26:53819 | 3h27m | 449ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;430f37;43002ac | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:32894 | 1h9m | 31ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;224b6be;223c5c | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36351 | 49m54s | 263ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;5eb4fc;5ea5e7a | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.20:48227 | 2h46m | 116ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;5b4b7;5b3c68e | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.15:39676 | 1h57m | 262ms |RUNNING
> |SINGLE | root | ci | [] | |3;5e96d;5e87bd
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.14:58104 | 2h15m | 245ms |RUNNING
> |SINGLE | root | ci | [] | |3;545a7f;544b7
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.27:35745 | 1h52m | 231ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;417895c;41698 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.22:40331 | 2h30m | 192ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;54c3d4f;54b4c | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:60923 | 1h32m | 261ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;5f004fc;5ef13f | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42506 | 2h54m | 117ms | QUEUED
> |SINGLE | root | ci | [] | |3;404b67;403c5
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.22:40342 | 2h29m | 34ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;6bc3f;6bb4e5d | [] | {}
> ip-10-1-2-26:9997 | 10.1.2.16:45905 | 21s291ms | 6s841ms | IDLE
> |SINGLE | root | ci | [] |
> |3;4396ab;43879c | [] | {}
> ip-10-1-2-18:9997 | 10.1.2.26:48600 | 2m2s | 5ms | IDLE
> |SINGLE | root | ci | [] |
> |3;4b1e32;4b0f22 | [] | {}
> ip-10-1-2-20:9997 | 10.1.2.21:36546 | 2m18s | 7s920ms | IDLE
> |SINGLE | root | ci | [] |
> |3;601e91;600f83 | [] | {}
> {noformat}
> Below is the output ~20 min later.
> {noformat}
> root@test160> listscans
> TABLET SERVER | CLIENT | AGE | LAST | STATE |
> TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET |
> ITERATORS | ITERATOR OPTIONS
> ip-10-1-2-15:9997 | 10.1.2.14:36125 | 3m10s | 3ms | IDLE
> |SINGLE | root | ci | [] |
> |3;1c4ba5a;1c3c9 | [] | {}
> ip-10-1-2-14:9997 | 10.1.2.16:35327 | 5m9s | 1ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;1b0f58c;1b004c | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42586 | 2h54m | 509ms |RUNNING
> |SINGLE | root | ci | [] | |3;422d5;421e3d
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.13:40589 | 2h40m | 251ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;7f1e43e;7f0f3 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.25:55164 | 1h17m | 26ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;1bf149b;1be238 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42618 | 2h47m | 455ms |RUNNING
> |SINGLE | root | ci | [] | |3;555a83;554b7
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:60869 | 2h8m | 352ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;42e206d;42d2f | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.14:59576 | 25m43s | 112ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;225a77;224b6be | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.27:35342 | 3h22m | 113ms | QUEUED
> |SINGLE | root | ci | [] | |3;6587a;65789e
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36073 | 2h52m | 299ms |RUNNING
> |SINGLE | root | ci | [] | |3;41f107;41e1f
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.13:40526 | 2h50m | 71ms |RUNNING
> |SINGLE | root | ci | [] | |3;423c6;422d5
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.18:60560 | 2h58m | 160ms | QUEUED
> |SINGLE | root | ci | [] | |3;424b6f;423c6
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.29:45044 | 1h39m | 426ms |RUNNING
> |SINGLE | root | ci | [] | |3;1d0048;1cf14
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36053 | 2h54m | 184ms | QUEUED
> |SINGLE | root | ci | [] | |3;28b4f;28a5e2
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.26:53819 | 3h48m | 263ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;430f37;43002ac | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:32894 | 1h30m | 163ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;224b6be;223c5c | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.16:36351 | 1h11m | 180ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;5eb4fc;5ea5e7a | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.20:48227 | 3h7m | 317ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;5b4b7;5b3c68e | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.15:39676 | 2h19m | 160ms |RUNNING
> |SINGLE | root | ci | [] | |3;5e96d;5e87bd
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.14:58104 | 2h36m | 238ms |RUNNING
> |SINGLE | root | ci | [] | |3;545a7f;544b7
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.27:35745 | 2h13m | 162ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;417895c;41698 | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.22:40331 | 2h51m | 72ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;54c3d4f;54b4c | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.17:60923 | 1h53m | 27ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;5f004fc;5ef13f | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.28:42506 | 3h16m | 268ms |RUNNING
> |SINGLE | root | ci | [] | |3;404b67;403c5
> | [] | {}
> ip-10-1-2-23:9997 | 10.1.2.22:40342 | 2h51m | 239ms | QUEUED
> |SINGLE | root | ci | [] |
> |3;6bc3f;6bb4e5d | [] | {}
> ip-10-1-2-29:9997 | 10.1.2.15:56044 | 4s505ms | 3ms | IDLE
> |SINGLE | root | ci | [] | |3;602da;601e91
> | [] | {}
> ip-10-1-2-16:9997 | 10.1.2.21:50234 | 51s534ms | 9ms |RUNNING
> |SINGLE | root | ci | [] |
> |3;20f12;20e218f | [] | {}
> ip-10-1-2-16:9997 | 10.1.2.26:50232 | 3m10s | 5ms |RUNNING
> |SINGLE | root | ci | [] | |3;5e4b66;5e3c5
> | [] | {}
> ip-10-1-2-16:9997 | 10.1.2.21:50206 | 3m47s | 285ms | IDLE
> |SINGLE | root | ci | [] |
> |3;1ad31;1ac400b | [] | {}
> ip-10-1-2-28:9997 | 10.1.2.18:38857 | 42s643ms | 6ms | IDLE
> |SINGLE | root | ci | [] | |3;2ef13;2ee229
> | [] | {}
> ip-10-1-2-20:9997 | 10.1.2.20:44062 | 1m23s | 4s928ms | IDLE
> |SINGLE | root | ci | [] |
> |3;6b4b7;6b3c669 | [] | {}
> {noformat}
> I am not sure what caused things to get in this situation, but I have a
> theory. While the mappers were running a single AWS node was rebooted for
> some reason. This would have caused tablets to migrate. AccumuloInputFormat
> calculates its locality information up front, if tablets move mappers will
> run where the tablets used to be. So maybe a slighty higher than avg number
> of tablets started reading from ip-10-1-2-23 as a result of the migration.
> This caused those mapper to run slower and over time more mappers read from
> ip-10-1-2-23 and things just snowballed.
> Regardless of how this situation occurred, Accumulo should handle it better
> when it does occur. If a single tablet server has much higher number of
> clients than avg attempting to read for long periods of time, then something
> should be done. In this case decisions could not be made off of the read
> rate, because this tserver had a much lower read rate than other tservers
> that only had 1 or 2 mappers reading.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)