[
https://issues.apache.org/jira/browse/KUDU-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xu Yao updated KUDU-2437:
-------------------------
Description:
When reading data in a kudu table using spark, if there is a large amount of
data in the tablet, reading the data takes a long time. The reason is that
KuduRDD uses a tablet to generate the scanToken, so a spark task needs to
process all the data in a tablet.
We think that TabletServer should provide an RPC interface, which can be split
tablet into multiple primary key ranges by size. The kudu-client can choose
whether to perform parallel scan according to the case.
RPC interface:
> Split a tablet into primary key ranges by size
> ----------------------------------------------
>
> Key: KUDU-2437
> URL: https://issues.apache.org/jira/browse/KUDU-2437
> Project: Kudu
> Issue Type: Improvement
> Components: client, tablet
> Reporter: Xu Yao
> Assignee: Xu Yao
> Priority: Major
>
> When reading data in a kudu table using spark, if there is a large amount of
> data in the tablet, reading the data takes a long time. The reason is that
> KuduRDD uses a tablet to generate the scanToken, so a spark task needs to
> process all the data in a tablet.
> We think that TabletServer should provide an RPC interface, which can be
> split tablet into multiple primary key ranges by size. The kudu-client can
> choose whether to perform parallel scan according to the case.
> RPC interface:
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)