[
https://issues.apache.org/jira/browse/IMPALA-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716828#comment-17716828
]
Riza Suminto commented on IMPALA-8081:
--------------------------------------
[~baggio000] I'm currently working on similar Jira like this one: IMPALA-12091.
However, it only applies on different parallelism mode
(COMPUTE_PROCESSING_COST=true).
> Avoid over-parallelizing queries when there are small input splits
> ------------------------------------------------------------------
>
> Key: IMPALA-8081
> URL: https://issues.apache.org/jira/browse/IMPALA-8081
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Janaki Lahorani
> Assignee: Yida Wu
> Priority: Major
> Labels: multithreading
>
> Currently we maximise parallelism given the number of input splits available.
> This is often a good decision, unless there are very many small input splits,
> particularly small files. We could avoid this pathological behaviour by
> having a minimum threshold of input bytes per instance (this is still pretty
> crude, since file input bytes only correlates loosely with the amount of work
> required).
> An example:
> {noformat}
> [localhost.EXAMPLE.COM:21050] default> show files in functional.alltypes;
> Query: show files in functional.alltypes
> +-------------------------------------------------------------------------------+---------+--------------------+
> | Path
> | Size | Partition |
> +-------------------------------------------------------------------------------+---------+--------------------+
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=1/090101.txt
> | 19.95KB | year=2009/month=1 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=2/090201.txt
> | 18.12KB | year=2009/month=2 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=3/090301.txt
> | 20.06KB | year=2009/month=3 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=4/090401.txt
> | 19.61KB | year=2009/month=4 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=5/090501.txt
> | 20.36KB | year=2009/month=5 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=6/090601.txt
> | 19.71KB | year=2009/month=6 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=7/090701.txt
> | 20.36KB | year=2009/month=7 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=8/090801.txt
> | 20.36KB | year=2009/month=8 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=9/090901.txt
> | 19.71KB | year=2009/month=9 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=10/091001.txt
> | 20.36KB | year=2009/month=10 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=11/091101.txt
> | 19.71KB | year=2009/month=11 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2009/month=12/091201.txt
> | 20.36KB | year=2009/month=12 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=1/100101.txt
> | 20.36KB | year=2010/month=1 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=2/100201.txt
> | 18.39KB | year=2010/month=2 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=3/100301.txt
> | 20.36KB | year=2010/month=3 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=4/100401.txt
> | 19.71KB | year=2010/month=4 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=5/100501.txt
> | 20.36KB | year=2010/month=5 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=6/100601.txt
> | 19.71KB | year=2010/month=6 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=7/100701.txt
> | 20.36KB | year=2010/month=7 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=8/100801.txt
> | 20.36KB | year=2010/month=8 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=9/100901.txt
> | 19.71KB | year=2010/month=9 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=10/101001.txt
> | 20.36KB | year=2010/month=10 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=11/101101.txt
> | 19.71KB | year=2010/month=11 |
> |
> hdfs://172.19.0.1:20500/test-warehouse/alltypes/year=2010/month=12/101201.txt
> | 20.36KB | year=2010/month=12 |
> +-------------------------------------------------------------------------------+---------+--------------------+
> [localhost:21000] default> set mt_dop=8; select count(*) from
> functional.alltypes; summary;
> MT_DOP set to 8
> Query: select count(*) from functional.alltypes
> Query submitted at: 2020-10-30 17:47:26 (Coordinator:
> http://tarmstrong-box:25000)
> Query progress can be monitored at:
> http://tarmstrong-box:25000/query_plan?query_id=d242927a09d39968:3ae0ecec00000000
> +----------+
> | count(*) |
> +----------+
> | 7300 |
> +----------+
> Fetched 1 row(s) in 0.19s
> +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
> | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows |
> Peak Mem | Est. Peak Mem | Detail |
> +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
> | F01:ROOT | 1 | 242.85us | 242.85us | | | 0
> B | 0 B | |
> | 03:AGGREGATE | 1 | 2.51ms | 2.51ms | 1 | 1 |
> 16.00 KB | 10.00 MB | FINALIZE |
> | 02:EXCHANGE | 1 | 616.31us | 616.31us | 24 | 1 |
> 240.00 KB | 16.00 KB | UNPARTITIONED |
> | F00:EXCHANGE SENDER | 24 | 1.34ms | 2.66ms | | |
> 16.00 KB | 0 B | |
> | 01:AGGREGATE | 24 | 1.52ms | 2.13ms | 24 | 1 |
> 16.00 KB | 10.00 MB | |
> | 00:SCAN HDFS | 24 | 32.76ms | 35.75ms | 7.30K | 7.30K |
> 32.00 KB | 16.00 MB | functional.alltypes |
> +---------------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------+
> {noformat}
> In this example, we create 8 instances per impala daemon to scan a tiny
> amount of data each. We would be better off, typically, in creating fewer
> instances to avoid the overhead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]