[
https://issues.apache.org/jira/browse/KUDU-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900626#comment-16900626
]
Adar Dembo commented on KUDU-2917:
----------------------------------
This was discussed briefly in the [original code
review|https://gerrit.cloudera.org/c/10406/#message-ebc6a28c_55383bfd]:
{quote}
> I've got a high-level question about the kind of parallelism you're
> seeking.
>
> If I'm reading the patch correctly, the tablet chunk size input is
> the encoded and compressed base data size of a particular query's
> projected columns. This is effectively chunking by the amount of IO
> the query is expected to need, ignoring deltas because it's hard to
> calculate their impact without reading them all from disk.
>
> Why did you choose this particular input? How is it better for a
> client like Spark than, say, a chunking approach that produced
> equally sized result sets (which would need to consider
> decoded/decompressed data, plus perhaps input from
> MemRowSet/DeltaMemStores, and maybe delta files too)?
Sorry for late reply. I think it's better that use the decoded/decompressed
data size. But the chunking approach costs a lot. So, I use the
encoded/compressed data size to chunking.
{quote}
> Split a tablet into primary key ranges by number of row
> -------------------------------------------------------
>
> Key: KUDU-2917
> URL: https://issues.apache.org/jira/browse/KUDU-2917
> Project: Kudu
> Issue Type: Improvement
> Reporter: Xu Yao
> Assignee: Xu Yao
> Priority: Major
>
> Since we implemented
> [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and
> [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job
> can read data inside the tablet in parallel. However, we found in actual use
> that splitting key range by size may cause the spark task to read long tails.
> (Some tasks read more data when the data size in KeyRange is basically the
> same.)
> I think this issue is caused by the encoding and compression of column-wise.
> So I think maybe split the primary key range by the number of rows might be a
> good choice.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)