[
https://issues.apache.org/jira/browse/KYLIN-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928302#comment-17928302
]
Guoliang Sun commented on KYLIN-6025:
-------------------------------------
h3. Dev Design
Before writing to the internal table, sort the data by the partition column
(`partition col`) to ensure that data for the same partition is distributed to
the same task as much as possible during data distribution.
Add a configuration to control whether sorting should be performed:
`kylin.internal-table.sort-by-partition.enabled`, with a default value of
`true`. This configuration supports both system-level and project-level
settings.
Additionally, provide a table-level configuration `sortByPartition` with the
highest priority. This can only be configured via the API by specifying
`tbl_properties` in the request when creating or updating an internal table.
> Support file merging within partitions for internal tables
> ----------------------------------------------------------
>
> Key: KYLIN-6025
> URL: https://issues.apache.org/jira/browse/KYLIN-6025
> Project: Kylin
> Issue Type: New Feature
> Affects Versions: 5.0.0
> Reporter: Guoliang Sun
> Priority: Major
>
> When multiple tasks write to the same internal table partition during the
> build phase, the data is written into multiple subdirectories, which can
> easily lead to an excessive number of files and increase HDFS pressure. A
> reasonable merging mechanism is needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)