[jira] [Updated] (KYLIN-5571) It takes too much time to calculate the data size during pushing down queries, which will lead to the queries un-stoppable.

Guangyuan Feng (Jira) Thu, 08 Jun 2023 01:35:06 -0700


     [ 
https://issues.apache.org/jira/browse/KYLIN-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guangyuan Feng updated KYLIN-5571:
----------------------------------
    Description: 
During pushing down the query, KE will try to calculate the included data size 
to set Spark partitions, but if there were too many files on HDFS, it will take 
a lot of time to complete.

So in order to improve this situation, the following things will be done:
 # Using a limited thread pool to calculate the data size
 # Add timeout for the calculation, so as to stop the query as soon as possible
 # Add new properties:
{color:#4c9aff}_kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3_{color}，the
 default Spark partition num
_{color:#4c9aff}kylin.query.pushdown.auto-set-shuffle-partitions-timeout=30{color},_
 the maximum timeout, 30 seconds by default, to calculate the data size in 
order to adjust the Spark partition num

After these changes, we can expected the query complete in a fixed duration.

  was:
During pushing down the query, KE will try to calculate the included data size 
to set Spark partitions, but if there were too many files on HDFS, it will take 
a lot of time to complete.

So in order to improve this situation, the following things will be done:
 # Using a limited thread pool to calculate the data size
 # Add timeout for the calculation, so as to stop the query as soon as possible
 # Add new properties:
{_}kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3{_}，the default 
Spark partition num
_getAutoShufflePartitionTimeOut=30,_ the maximum timeout, 30 seconds by 
default, to calculate the data size in order to adjust the Spark partition num

After these changes, we can expected the query complete in a fixed duration.


> It takes too much time to calculate the data size during pushing down 
> queries, which will lead to the queries un-stoppable. 
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5571
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5571
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Query Engine
>    Affects Versions: 5.0-alpha
>            Reporter: Guangyuan Feng
>            Assignee: Guangyuan Feng
>            Priority: Major
>             Fix For: 5.0-alpha
>
>
> During pushing down the query, KE will try to calculate the included data 
> size to set Spark partitions, but if there were too many files on HDFS, it 
> will take a lot of time to complete.
> So in order to improve this situation, the following things will be done:
>  # Using a limited thread pool to calculate the data size
>  # Add timeout for the calculation, so as to stop the query as soon as 
> possible
>  # Add new properties:
> {color:#4c9aff}_kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3_{color}，the
>  default Spark partition num
> _{color:#4c9aff}kylin.query.pushdown.auto-set-shuffle-partitions-timeout=30{color},_
>  the maximum timeout, 30 seconds by default, to calculate the data size in 
> order to adjust the Spark partition num
> After these changes, we can expected the query complete in a fixed duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KYLIN-5571) It takes too much time to calculate the data size during pushing down queries, which will lead to the queries un-stoppable.

Reply via email to