[ 
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900626#comment-16900626
 ] 

Adar Dembo commented on KUDU-2917:
----------------------------------

This was discussed briefly in the [original code 
review|https://gerrit.cloudera.org/c/10406/#message-ebc6a28c_55383bfd]:

{quote}
> I've got a high-level question about the kind of parallelism you're
> seeking.
> 
> If I'm reading the patch correctly, the tablet chunk size input is
> the encoded and compressed base data size of a particular query's
> projected columns. This is effectively chunking by the amount of IO
> the query is expected to need, ignoring deltas because it's hard to
> calculate their impact without reading them all from disk.
> 
> Why did you choose this particular input? How is it better for a
> client like Spark than, say, a chunking approach that produced
> equally sized result sets (which would need to consider
> decoded/decompressed data, plus perhaps input from 
> MemRowSet/DeltaMemStores, and maybe delta files too)?

Sorry for late reply. I think it's better that use the decoded/decompressed 
data size. But the chunking approach costs a lot. So, I use the 
encoded/compressed data size to chunking.
{quote}

> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
>                 Key: KUDU-2917
>                 URL: https://issues.apache.org/jira/browse/KUDU-2917
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: Xu Yao
>            Assignee: Xu Yao
>            Priority: Major
>
> Since we implemented 
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and 
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job 
> can read data inside the tablet in parallel. However, we found in actual use 
> that splitting key range by size may cause the spark task to read long tails. 
> (Some tasks read more data when the data size in KeyRange is basically the 
> same.)
> I think this issue is caused by the encoding and compression of column-wise. 
> So I think maybe split the primary key range by the number of rows might be a 
> good choice.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to