[jira] [Created] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

yangz (JIRA) Wed, 23 Jan 2019 19:40:44 -0800

yangz created KUDU-2670:
---------------------------

             Summary: Splitting more tasks for spark job, and add more 
concurrent for scan operation
                 Key: KUDU-2670
                 URL: https://issues.apache.org/jira/browse/KUDU-2670
             Project: Kudu
          Issue Type: Improvement
          Components: java, spark
    Affects Versions: 1.8.0
            Reporter: yangz
             Fix For: 1.8.0



Refer to the KUDU-2437 Split a tablet into primary key ranges by size.

We need a java client implementation to support the split the tablet scan 
operation.

We suggest two new implementation for the java client.
 # A ConcurrentKuduScanner to get more scanner read data at the same time. This 
will be useful for one case.  We scanner only one row, but the predicate 
doesn't contain the primary key, for this case, we will send a lot scanner 
request but only one row return.It will be slow to send so much scanner request 
one by one. So we need a concurrent way. And by this case we test, for a 10G 
tablet, it will save a lot time for one machine.
 # A way to split more spark task. To do so, we need get scanner tokens for two 
step, first we send to the tserver to give range, then with this range we get 
more scanner tokens. For our usage we make a tablet 10G, but we split a task to 
process only 1G data. So we get better performance.

And all this feature has run well for us for half a year. We hope this feature 
will be useful for the community.
 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

Reply via email to