Hey folks, Discovered recently that when taking historicals down for maintenance that we get pretty significant query latency spikes across that node's tier.
These spikes seem be related to contention from ZKCoordinator threads unzipping segments from deep storage to replace those from the stopped historical. The default value for druid.segmentCache.numLoadingThreads is the number of cores on the host. I haven't done any detailed profiling to be sure but intuitively it seems like lower default might be safer to avoid contending with query workloads, at least setting it to a much lower values looks to have fixed our probllem. Maybe I've only noticed it because there's something unique in our setup, so I'm curious if anyone else has experienced something similar? Or even it'd be interesting if anyone can confirm that they don't see an impact on latency when nodes are taken down with the value for the config set to its default. (This is all on 0.16.1 btw, I haven't tried to replicate it on a newer version yet) Best regards, Dylan