[
https://issues.apache.org/jira/browse/KYLIN-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guangyuan Feng updated KYLIN-5571:
----------------------------------
Description:
During pushing down the query, KE will try to calculate the included data size
to set Spark partitions, but if there were too many files on HDFS, it will take
a lot of time to complete.
So in order to improve this situation, the following things will be done:
# Using a limited thread pool to calculate the data size
# Add timeout for the calculation, so as to stop the query as soon as possible
# Add new properties:
{color:#4c9aff}_kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3_{color},the
default Spark partition num
_{color:#4c9aff}kylin.query.pushdown.auto-set-shuffle-partitions-timeout=30{color},_
the maximum timeout, 30 seconds by default, to calculate the data size in
order to adjust the Spark partition num
After these changes, we can expected the query complete in a fixed duration.
was:
During pushing down the query, KE will try to calculate the included data size
to set Spark partitions, but if there were too many files on HDFS, it will take
a lot of time to complete.
So in order to improve this situation, the following things will be done:
# Using a limited thread pool to calculate the data size
# Add timeout for the calculation, so as to stop the query as soon as possible
# Add new properties:
{_}kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3{_},the default
Spark partition num
_getAutoShufflePartitionTimeOut=30,_ the maximum timeout, 30 seconds by
default, to calculate the data size in order to adjust the Spark partition num
After these changes, we can expected the query complete in a fixed duration.
> It takes too much time to calculate the data size during pushing down
> queries, which will lead to the queries un-stoppable.
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-5571
> URL: https://issues.apache.org/jira/browse/KYLIN-5571
> Project: Kylin
> Issue Type: Improvement
> Components: Query Engine
> Affects Versions: 5.0-alpha
> Reporter: Guangyuan Feng
> Assignee: Guangyuan Feng
> Priority: Major
> Fix For: 5.0-alpha
>
>
> During pushing down the query, KE will try to calculate the included data
> size to set Spark partitions, but if there were too many files on HDFS, it
> will take a lot of time to complete.
> So in order to improve this situation, the following things will be done:
> # Using a limited thread pool to calculate the data size
> # Add timeout for the calculation, so as to stop the query as soon as
> possible
> # Add new properties:
> {color:#4c9aff}_kylin.query.pushdown.auto-set-shuffle-partitions-multiple=3_{color},the
> default Spark partition num
> _{color:#4c9aff}kylin.query.pushdown.auto-set-shuffle-partitions-timeout=30{color},_
> the maximum timeout, 30 seconds by default, to calculate the data size in
> order to adjust the Spark partition num
> After these changes, we can expected the query complete in a fixed duration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)