Todd Lipcon has posted comments on this change. ( )

Change subject: KUDU-2437 Generate ScanToken from small chunks in tablet

Patch Set 4:

Hi Xu. Thanks for working on this! It will be really great to decouple the 
available parallelism from the number of partitions.

I have a couple concerns about the design in this patch, though.

1) periodic reporting of the bounds to the master: for our eventual scalability 
goals of thousands of nodes, each with thousands of tablets, I think this may 
be too much information to try to keep in memory on the master and to report on 
a periodic basis. Especially considering that many tablets may be cold (bounds 
not changing), it's a lot of redundant processing. Additionally, with it kept 
only in memory (not persisted) this means that shortly following a restart or 
failover, the master will not have the chunk bounds populated and the query 
plans submitted by users may change dramatically.

2) including the bounds in every GetTabletLocations call is also somewhat 
expensive from a size/performance perspective. In the case of writes or scans, 
they aren't needed. It would be better to avoid the expense for these common 
cases and only worry about the bounds for the specific use case you are 
targeting (generating splits for tasks)

3) from a security perspective, I'm not sure it's safe to include the bounds 
information in the 'getTabletLocations' RPC response. When we start adding more 
fine-grained authorization it's likely that we will want to provide location 
information to administrators/tools (eg to facilitate balancing or other ops 
workflows) but those administrators may not have access to read the underlying 
data. The bounds themselves can likely expose user data (eg imagine that the 
primary key is an email address or government ID number that should not be 

I'd like to consider an alternate design in which we add a new RPC to the 
tablet server itself, something like:

message SplitKeyRangeRequest {
  bytes tablet_id
  bytes start_pk
  bytes end_pk

  // Number of bytes to try to return in each chunk. This is a hint.
  // The tablet server may return chunks larger or smaller than this value.
  int64 target_chunk_size_bytes

  // The columns to consider when chunking.
  // If specified, then the size estimate used for 'target_chunk_size_bytes'
  // should only include these columns. This can be used if a query will
  // only scan a certain subset of the columns.
  repeated int32 col_ids

(perhaps the 'col_ids' feature could be left out initially)

message SplitKeyRangeResponse {
  repeated bytes split_pks
  repeated int64 size_estimates

The advantages I see for this design are:
- no need to centralize the split metadata (so we can be more fine-grained 
without worrying about master load or memory)
- for the purposes of locality scheduling, we only need the partition 
locations, so the "planner" (spark driver or impala front-end) doesn't care 
about the sub-partition splits. The individual tasks can then contact their 
local tserver to further sub-divide the range
- on-demand computation of the splits means we will be able to do a better job 
of lazily opening cold tablets later.

What do you think?

To view, visit
To unsubscribe, visit

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9ec4395919f4b54102e458ef5154334c08412e8a
Gerrit-Change-Number: 10406
Gerrit-PatchSet: 4
Gerrit-Owner: Xu Yao <>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-Comment-Date: Thu, 17 May 2018 17:13:07 +0000
Gerrit-HasComments: No

Reply via email to