Tim Armstrong created IMPALA-10347:
--------------------------------------
Summary: Explore approaches to optimizing queries that will likely
be short-circuited by limits
Key: IMPALA-10347
URL: https://issues.apache.org/jira/browse/IMPALA-10347
Project: IMPALA
Issue Type: Improvement
Components: Distributed Exec
Reporter: Tim Armstrong
Based on discussion with [~amansinha], there are opportunities beyond
IMPALA-10314 to optimize queries where there is a limit and the query is
*unlikely* to scan many files.
The problem is that we do all the work to generate scan ranges and schedule
them upfront, which adds a lot of overhead if only a small number of files
actually need to be processed.
A couple of ideas we had:
* Parallelize and/or otherwise optimize the scan range generation
* Speculatively execute the query on a subset of files and then cancel and
retry if we hit the limit
* Incrementally generate scan ranges and assign them to executors so that scan
range generation and execution can be overlapped. This is the most general
solution but also has a lot of knock-on implications for other subsystems, like
cardinality/memory estimation, scheduling, query execution, query coordination,
etc.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]