Lars Volker created IMPALA-6834:
-----------------------------------

             Summary: Enforce consistent, pseudo-random replica order during 
local, non-random scheduling
                 Key: IMPALA-6834
                 URL: https://issues.apache.org/jira/browse/IMPALA-6834
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend, Frontend
    Affects Versions: Impala 2.12.0
            Reporter: Lars Volker


When scheduling local, non-cached reads, the scheduler breaks ties between 
otherwise equivalent replicas by picking the first one in the list of 
candidates as reported by the frontend 
([scheduler.h:375|https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler.h#L375]).
 We do this to optimize the use of OS buffer caches. We also provide the 
random_replica query option to improve cases where the default behavior leads 
to CPU hotspots.

The frontend merely passes the replicas of a block to the backend in the order 
they are returned from HDFS. This can result in poor load distribution for 
scans with many small scan ranges,  depending on the order in which HDFS 
returns block locations.

To alleviate this we should add another level of pseudo-random shuffling. The 
main idea is to change the order in which replicas are picked, but to keep the 
order consistent across backends. 

For example, we can compute a hash for each replica over (replica address, 
THdfsFileSplit::file_name, THdfsFileSplit::offset) and use this as a key to 
find the min_element in the list of candidates 
[scheduler.cc:L772|https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler.cc#L772].
 Then we use this element instead of the first one.

We need to make sure that we run sufficiently comprehensive performance tests 
on this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to