David Rorke created IMPALA-10481:
------------------------------------

             Summary: Lack of TServer affinity in remote Kudu scans results in 
bad OS buffer cache behavior on tablet servers
                 Key: IMPALA-10481
                 URL: https://issues.apache.org/jira/browse/IMPALA-10481
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 4.0
            Reporter: David Rorke


Remote Kudu scans can take many iterations against the same scan range before 
achieving good performance if the OS buffer cache is initially cold on the 
tablet servers.  The slow warmup of the buffer cache is exacerbated by the fact 
that remote scans in the default Impala config choose a tablet server at random 
from the replica candidates.  The Kudu client supports a LEADER_ONLY option 
that provides hard affinity to the leader replica, and Impala allows this to be 
configured using the --pick_only_leaders_for_tests option, but this is 
currently considered a testing only option and by default Impala will connect 
to a random replica.

The following is a series of iterations of TPC-DS query 33 (times in seconds), 
against a freshly started Kudu cluster, in 3 configurations (1) local reads, 
with Impala running on Kudu cluster, (2) remote reads from separate Impala 
cluster with default config, (3) remote reads with 
pick_only_leaders_for_tests=true (LEADER_ONLY affinity)

 
||Config||Iteration 1||Iter 2||Iter 3||Iter 4||Iter 5||Iter 6||Iter 7||Iter 
8||Iter 9||
|Local|111.4|14.6| | | | | | | |
|Remote (default config)|110.8|56.9|49.9|43.3|37.3|44.0|20.0|28.9|14.9|
|Remote (LEADER_ONLY)|120.1|16.2|15.7|14.2| | | | | |

With pick_only_leaders_for_tests, the remote performance improves quickly, 
approaching local performance on the second iteration and warming up fully by 
iteration 4.   In the default config it takes 9 iterations of the query before 
we see the same performance.

Running similar experiments after explicitly dropping the buffer cache on the 
tablet servers confirmed that this slow warmup is caused by poor buffer cache 
hit rates until the cache is fully warm.

I suspect that slow warmup isn't the only consequence of this.  Caching a given 
tablet in the buffer cache on multiple tablet servers increases the overall 
buffer cache footprint and will increase tserver memory pressure under load.

We should consider setting the LEADER_ONLY option by default for remote Kudu 
reads.  The only concern would be that this might result in worse load 
balancing and hotspots, in which case Kudu might need to implement some 
additional connection option that provides a better mix of affinity and load 
balancing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to