[
https://issues.apache.org/jira/browse/IMPALA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709460#comment-16709460
]
Philip Zeyliger commented on IMPALA-7928:
-----------------------------------------
We're roughly proposing to change the following bit to, instead of picking the
least used executor from the heap, to pick the minimum of, say, five executors,
determined by five distinct hashes of the filename. That should have limited
skew (though we've not worked out a model for that) but let the file handle
cache work linearly in cluster size.
{code}
const IpAddr* Scheduler::AssignmentCtx::SelectRemoteExecutor() {
const IpAddr* candidate_ip;
if (HasUnusedExecutors()) {
// Pick next unused executor.
candidate_ip = GetNextUnusedExecutorAndIncrement();
} else {
// Pick next executor from assignment_heap. All executors must have been
inserted into
// the heap at this point.
DCHECK_GT(executors_config_.NumBackends(), 0);
DCHECK_EQ(executors_config_.NumBackends(), assignment_heap_.size());
candidate_ip = &(assignment_heap_.top().ip);
}
DCHECK(candidate_ip != nullptr);
return candidate_ip;
}
{code}
> Investigate consistent placement of remote scan ranges
> ------------------------------------------------------
>
> Key: IMPALA-7928
> URL: https://issues.apache.org/jira/browse/IMPALA-7928
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 3.2.0
> Reporter: Joe McDonnell
> Priority: Major
>
> With the file handle cache, it is useful for repeated scans of the same file
> to go to the same node, as that node will already have a file handle cached.
> When scheduling remote ranges, the scheduler introduces randomness that can
> spread reads across all of the nodes. Repeated executions of queries on the
> same set of files will not schedule the remote reads on the same nodes. This
> causes a large amount of duplication across file handle caches on different
> nodes. This reduces the efficiency of the cache significantly.
> It may be useful for the scheduler to introduce some determinism in
> scheduling remote reads to take advantage of the file handle cache. This is a
> variation on the well-known tradeoff between skew and locality.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]