[
https://issues.apache.org/jira/browse/IMPALA-12029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726875#comment-17726875
]
ASF subversion and git services commented on IMPALA-12029:
----------------------------------------------------------
Commit dbddb0844713677cd5165c55fe21ef46238d3e24 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dbddb0844 ]
IMPALA-12120: Limit output writer parallelism based on write volume
The new processing cost-based planner changes (IMPALA-11604,
IMPALA-12091) will impact output writer parallelism for insert queries,
with the potential for more small files if the processing cost-based
planning results in too many writer fragments. It can further exacerbate
a problem introduced by MT_DOP (see IMPALA-8125).
The MAX_FS_WRITERS query option can help mitigate this. But even without
the MAX_FS_WRITERS set, the default output writer parallelism should
avoid creating excessive writer parallelism for partitioned and
unpartitioned inserts.
This patch implements such a limit when using the cost-based planner. It
limits the number of writer fragments such that each writer fragment
writes at least 256MB of rows. This patch also allows CTAS (a kind of
DDL query) to be eligible for auto-scaling.
This patch also remove comments about NUM_SCANNER_THREADS added by
IMPALA-12029, since it does not applies anymore after IMPALA-12091.
Testing:
- Add test cases in test_query_cpu_count_divisor_default
- Add test_processing_cost_writer_limit in test_insert.py
- Pass test_insert.py::TestInsertHdfsWriterLimit
- Pass test_executor_groups.py
Change-Id: I289c6ffcd6d7b225179cc9fb2f926390325a27e0
Reviewed-on: http://gerrit.cloudera.org:8080/19880
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Query can be under parallelized in multi executor group set setup
> -----------------------------------------------------------------
>
> Key: IMPALA-12029
> URL: https://issues.apache.org/jira/browse/IMPALA-12029
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.3.0
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Critical
> Fix For: Impala 4.3.0
>
>
> In multiple executor group set setup, Frontend will try to match a query with
> the smallest executor group set that can fit the memory and cpu requirement
> of the compiled query. There are kind of query where the compiled plan will
> fit to any executor group set but not necessarily deliver the best
> performance. An example for this is Impala's COMPUTE STATS query. It does
> full table scan and aggregate the stats, have fairly simple query plan shape,
> but can benefit from higher scan parallelism.
> Planner needs to give additional feedback to Frontend that the query might be
> under parallelized under current executor group. Frontend can then make
> judgement whether to assign the compiled plan to current executor group
> anyway, or try step up to the next larger executor group and increase
> parallelism.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]