[
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900644#comment-16900644
]
Xu Yao commented on KUDU-2917:
------------------------------
Emm, maybe we can also solve the long tail problem by recording the original
size of the data in CFile. KUDU-2917 can be used as a separate feature. :)
> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
> Key: KUDU-2917
> URL: https://issues.apache.org/jira/browse/KUDU-2917
> Project: Kudu
> Issue Type: Improvement
> Reporter: Xu Yao
> Assignee: Xu Yao
> Priority: Major
>
> Since we implemented
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job
> can read data inside the tablet in parallel. However, we found in actual use
> that splitting key range by size may cause the spark task to read long tails.
> (Some tasks read more data when the data size in KeyRange is basically the
> same.)
> I think this issue is caused by the encoding and compression of column-wise.
> For example, we store 1000 rows of data in column-wise. If most of these
> columns have the same values, less storage space is required. Instead, If
> these columns have different values, more storage is needed. So I think maybe
> split the primary key range by the number of rows might be a good choice.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)