David Rorke created IMPALA-10481:
------------------------------------
Summary: Lack of TServer affinity in remote Kudu scans results in
bad OS buffer cache behavior on tablet servers
Key: IMPALA-10481
URL: https://issues.apache.org/jira/browse/IMPALA-10481
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 4.0
Reporter: David Rorke
Remote Kudu scans can take many iterations against the same scan range before
achieving good performance if the OS buffer cache is initially cold on the
tablet servers. The slow warmup of the buffer cache is exacerbated by the fact
that remote scans in the default Impala config choose a tablet server at random
from the replica candidates. The Kudu client supports a LEADER_ONLY option
that provides hard affinity to the leader replica, and Impala allows this to be
configured using the --pick_only_leaders_for_tests option, but this is
currently considered a testing only option and by default Impala will connect
to a random replica.
The following is a series of iterations of TPC-DS query 33 (times in seconds),
against a freshly started Kudu cluster, in 3 configurations (1) local reads,
with Impala running on Kudu cluster, (2) remote reads from separate Impala
cluster with default config, (3) remote reads with
pick_only_leaders_for_tests=true (LEADER_ONLY affinity)
||Config||Iteration 1||Iter 2||Iter 3||Iter 4||Iter 5||Iter 6||Iter 7||Iter
8||Iter 9||
|Local|111.4|14.6| | | | | | | |
|Remote (default config)|110.8|56.9|49.9|43.3|37.3|44.0|20.0|28.9|14.9|
|Remote (LEADER_ONLY)|120.1|16.2|15.7|14.2| | | | | |
With pick_only_leaders_for_tests, the remote performance improves quickly,
approaching local performance on the second iteration and warming up fully by
iteration 4. In the default config it takes 9 iterations of the query before
we see the same performance.
Running similar experiments after explicitly dropping the buffer cache on the
tablet servers confirmed that this slow warmup is caused by poor buffer cache
hit rates until the cache is fully warm.
I suspect that slow warmup isn't the only consequence of this. Caching a given
tablet in the buffer cache on multiple tablet servers increases the overall
buffer cache footprint and will increase tserver memory pressure under load.
We should consider setting the LEADER_ONLY option by default for remote Kudu
reads. The only concern would be that this might result in worse load
balancing and hotspots, in which case Kudu might need to implement some
additional connection option that provides a better mix of affinity and load
balancing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]