Csaba Ringhofer created IMPALA-11561:
----------------------------------------
Summary: Improve intra-node scheduling of scan ranges
Key: IMPALA-11561
URL: https://issues.apache.org/jira/browse/IMPALA-11561
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
This ticket is created as a follow up for IMPALA-11539 /
https://gerrit.cloudera.org/#/c/18929/ , as several improvement ideas came up
during the review.
The commit above changes intra node scan range scheduling in the mt_dop != 0
case to process the scan ranges order by size (descending) to reduce skew among
fragment instances - before that the order was random, with the exception of
handling files in HDFS cache before files not in HDFS cache.
The following ideas came up:
1. Take caching into account and process scan ranges with more cached bytes /
file handles first. This way we could avoid avoid evicting these from the cache
during scanning.
2. Take disk id into account and try to process files from different disks in
parallel.
3. Have a more sophisticated estimation of CPU cost than scan size and order by
that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)