[jira] [Updated] (KUDU-2917) Split a tablet into primary key ranges by number of row

Xu Yao (JIRA) Mon, 05 Aug 2019 22:21:50 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xu Yao updated KUDU-2917:
-------------------------
    Description: 
Since we implemented 
[KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
[KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job can 
read data inside the tablet in parallel. However, we found in actual use that 
splitting key range by size may cause the spark task to read long tails. (Some 
tasks read more data when the data size in KeyRange is basically the same.)

I think this issue is caused by the encoding and compression of column-wise. 
For example, we store 1000 rows of data in column-wise. If most of these 
columns have the same values, less storage space is required. Instead, If these 
columns have different values, more storage is needed. So I think maybe split 
the primary key range by the number of rows might be a good choice.


  was:
Since we implemented 
[KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
[KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job can 
read data inside the tablet in parallel. However, we found in actual use that 
splitting key range by size may cause the spark task to read long tails. (Some 
tasks read more data when the data size in KeyRange is basically the same.)

I think this issue is caused by the encoding and compression of column-wise. So 
I think maybe split the primary key range by the number of rows might be a good 
choice.



> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
>                 Key: KUDU-2917
>                 URL: https://issues.apache.org/jira/browse/KUDU-2917
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: Xu Yao
>            Assignee: Xu Yao
>            Priority: Major
>
> Since we implemented 
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job 
> can read data inside the tablet in parallel. However, we found in actual use 
> that splitting key range by size may cause the spark task to read long tails. 
> (Some tasks read more data when the data size in KeyRange is basically the 
> same.)
> I think this issue is caused by the encoding and compression of column-wise. 
> For example, we store 1000 rows of data in column-wise. If most of these 
> columns have the same values, less storage space is required. Instead, If 
> these columns have different values, more storage is needed. So I think maybe 
> split the primary key range by the number of rows might be a good choice.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (KUDU-2917) Split a tablet into primary key ranges by number of row

Reply via email to