yangz created KUDU-2670:
---------------------------
Summary: Splitting more tasks for spark job, and add more
concurrent for scan operation
Key: KUDU-2670
URL: https://issues.apache.org/jira/browse/KUDU-2670
Project: Kudu
Issue Type: Improvement
Components: java, spark
Affects Versions: 1.8.0
Reporter: yangz
Fix For: 1.8.0
Refer to the KUDU-2437 Split a tablet into primary key ranges by size.
We need a java client implementation to support the split the tablet scan
operation.
We suggest two new implementation for the java client.
# A ConcurrentKuduScanner to get more scanner read data at the same time. This
will be useful for one case. We scanner only one row, but the predicate
doesn't contain the primary key, for this case, we will send a lot scanner
request but only one row return.It will be slow to send so much scanner request
one by one. So we need a concurrent way. And by this case we test, for a 10G
tablet, it will save a lot time for one machine.
# A way to split more spark task. To do so, we need get scanner tokens for two
step, first we send to the tserver to give range, then with this range we get
more scanner tokens. For our usage we make a tablet 10G, but we split a task to
process only 1G data. So we get better performance.
And all this feature has run well for us for half a year. We hope this feature
will be useful for the community.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)