clintropolis opened a new issue #5882: Coordinator load queue imbalance
URL: https://github.com/apache/incubator-druid/issues/5882
 
 
   Based on behavior observed on a coordinator on a test cluster, I believe an 
unintended consequence of #5532, which modified coordinator segment assignment 
logic to no longer continuously tell historical nodes to load a segment until 
it became available, is that now there can be scenarios where primary 
assignment can become incorrectly lumped into a deep load queue while other 
nodes have availability leading to longer than necessary segment unavailability 
and blocking realtime handoff. I think this was in fact an issue before the fix 
was added and may explain some of the needlessly long load queues encountered 
with the coordinator from time to time, but is now perhaps more apparent than 
was previously. The agitator of the problem is that nothing is ever removed 
from a load queue so this needs to be taken into consideration _somehow_, 
because of the fact that the environment can change between when the decision 
to place a segment in a particular load queue is made and subsequent runs.
   
   Consider a canary style deployment to update machine images in a cloud 
provider, where a new historical node is provisioned, observed, and if all is 
well, the remaining historical nodes are also replaced. If the coordinator were 
to run at a point where there is only a single historical announced, the fix of 
#5532 will result in this single node getting assigned ALL unavailable 
segments, and new historicals that appear later to hang around with near idle 
load queues, because the segment is already 'being loaded' somewhere, dragging 
out the time it takes for full availability and causing a large cluster 
imbalance (that does eventually right itself).
   
   ##### relevant log snippet
   ```
   18516004-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: 
Segments Moved: [44] Segments Let Alone: [0]
   18516179-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Assigned 2 segments among 5 servers
   18516344-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Dropped 0 segments among 5 servers
   18516508-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Moved 44 segment(s)
   18516657-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Let alone 0 segment(s)
   18516809:2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
   18516933-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-7-66.ec2.internal:8283, historical, _default_tier] has 36 left 
to load, 0 left to drop, 991,994,042 bytes queued, 48,377,883,776 bytes served.
   18517204-2018-06-13T23:44:54,626 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-12-20.ec2.internal:8283, historical, _default_tier] has 1 left 
to load, 0 left to drop, 14,313 bytes queued, 92,762,210,971 bytes served.
   18517470-2018-06-13T23:44:54,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-8-78.ec2.internal:8283, historical, _default_tier] has 5 left 
to load, 0 left to drop, 1,378,762,871 bytes queued, 99,371,506,890 bytes 
served.
   18517742-2018-06-13T23:44:54,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-3-9.ec2.internal:8283, historical, _default_tier] has 1 left 
to load, 0 left to drop, 1,925,142 bytes queued, 101,943,239,778 bytes served.
   18518010-2018-06-13T23:44:54,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-11-223.ec2.internal:8283, historical, _default_tier] has 1 
left to load, 0 left to drop, 43,619,569,242 bytes queued, 117,176,111,771 
bytes served.
   
   
   28201415-2018-06-14T00:43:00,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: 
One or fewer servers found.  Cannot balance.
   28201590-2018-06-14T00:43:00,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Assigned 20692 segments among 1 servers
   28201759:2018-06-14T00:43:00,627 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
   28201883-2018-06-14T00:43:00,628 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-7-66.ec2.internal:8283, historical, _default_tier] has 12,681 
left to load, 0 left to drop, 119,588,843,441 bytes queued, 48,611,264,826 
bytes served.
   
   28315379-2018-06-14T00:43:42,678 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorBalancer - [_default_tier]: 
Segments Moved: [50] Segments Let Alone: [0]
   28315554-2018-06-14T00:43:42,678 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Assigned 124 segments among 4 servers
   28315721-2018-06-14T00:43:42,678 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Moved 50 segment(s)
   28315870-2018-06-14T00:43:42,678 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : 
Let alone 0 segment(s)
   28316022:2018-06-14T00:43:42,678 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
   28316146-2018-06-14T00:43:42,679 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-4-113.ec2.internal:8283, historical, _default_tier] has 53 
left to load, 0 left to drop, 375,477,157 bytes queued, 0 bytes served.
   28316405-2018-06-14T00:43:42,679 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-10-72.ec2.internal:8283, historical, _default_tier] has 46 
left to load, 0 left to drop, 1,195,547,611 bytes queued, 0 bytes served.
   28316666-2018-06-14T00:43:42,679 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-1-8.ec2.internal:8283, historical, _default_tier] has 47 left 
to load, 0 left to drop, 1,193,785,951 bytes queued, 0 bytes served.
   28316925-2018-06-14T00:43:42,679 INFO [Coordinator-Exec--0] 
io.druid.server.coordinator.helper.DruidCoordinatorLogger - 
Server[ip-172-31-7-66.ec2.internal:8283, historical, _default_tier] has 12,615 
left to load, 0 left to drop, 119,201,301,693 bytes queued, 111,250,288,596 
bytes served.
   ```
   ##### under replicated segment metrics
   <img width="1444" alt="screen shot 2018-06-14 at 3 16 10 am" 
src="https://user-images.githubusercontent.com/1577461/41437654-e979d21e-6fd9-11e8-8447-1c62b4885bed.png";>
   
   ##### load queue metrics
   <img width="1451" alt="screen shot 2018-06-14 at 3 21 34 am" 
src="https://user-images.githubusercontent.com/1577461/41437699-06c0a3c0-6fda-11e8-84fd-331e94baa19d.png";>
   
   I'm currently experimenting with a couple of potential fixes, will follow up 
with a PR as soon as possible.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to