Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10406 )
Change subject: KUDU-2437 Generate ScanToken from small chunks in tablet ...................................................................... Patch Set 4: Hi Xu. Thanks for working on this! It will be really great to decouple the available parallelism from the number of partitions. I have a couple concerns about the design in this patch, though. 1) periodic reporting of the bounds to the master: for our eventual scalability goals of thousands of nodes, each with thousands of tablets, I think this may be too much information to try to keep in memory on the master and to report on a periodic basis. Especially considering that many tablets may be cold (bounds not changing), it's a lot of redundant processing. Additionally, with it kept only in memory (not persisted) this means that shortly following a restart or failover, the master will not have the chunk bounds populated and the query plans submitted by users may change dramatically. 2) including the bounds in every GetTabletLocations call is also somewhat expensive from a size/performance perspective. In the case of writes or scans, they aren't needed. It would be better to avoid the expense for these common cases and only worry about the bounds for the specific use case you are targeting (generating splits for tasks) 3) from a security perspective, I'm not sure it's safe to include the bounds information in the 'getTabletLocations' RPC response. When we start adding more fine-grained authorization it's likely that we will want to provide location information to administrators/tools (eg to facilitate balancing or other ops workflows) but those administrators may not have access to read the underlying data. The bounds themselves can likely expose user data (eg imagine that the primary key is an email address or government ID number that should not be exposed) I'd like to consider an alternate design in which we add a new RPC to the tablet server itself, something like: message SplitKeyRangeRequest { bytes tablet_id bytes start_pk bytes end_pk // Number of bytes to try to return in each chunk. This is a hint. // The tablet server may return chunks larger or smaller than this value. int64 target_chunk_size_bytes // The columns to consider when chunking. // If specified, then the size estimate used for 'target_chunk_size_bytes' // should only include these columns. This can be used if a query will // only scan a certain subset of the columns. repeated int32 col_ids } (perhaps the 'col_ids' feature could be left out initially) message SplitKeyRangeResponse { repeated bytes split_pks repeated int64 size_estimates } The advantages I see for this design are: - no need to centralize the split metadata (so we can be more fine-grained without worrying about master load or memory) - for the purposes of locality scheduling, we only need the partition locations, so the "planner" (spark driver or impala front-end) doesn't care about the sub-partition splits. The individual tasks can then contact their local tserver to further sub-divide the range - on-demand computation of the splits means we will be able to do a better job of lazily opening cold tablets later. What do you think? -- To view, visit http://gerrit.cloudera.org:8080/10406 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9ec4395919f4b54102e458ef5154334c08412e8a Gerrit-Change-Number: 10406 Gerrit-PatchSet: 4 Gerrit-Owner: Xu Yao <ocla...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 17 May 2018 17:13:07 +0000 Gerrit-HasComments: No