[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Andrew Wong (Code Review) Tue, 04 Sep 2018 12:42:27 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11263 )


Change subject: Blogpost describing index skip scan optimization.
......................................................................


Patch Set 5:

(14 comments)

Hrm, I'm not sure why it's not rendering on github for you. Maybe post a 
screenshot of the rendered jekyll? That'd be helpful too.

http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md
File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md:

http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73
PS4, Line 73:
            : Based on our experiments, on up to 10 million rows per tablet (as 
shown below), we found that the skip scan performa
> Added explanation about how we came to using this simple heuristic. Yes, it
I think it's the number of rows in the CFileSet, which I think is also the 
number of rows in the b-tree, but it isn't equal to the number of rows in the 
table (since that spans multiple tablets).


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md
File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md:

http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@13
PS5, Line 13: Example
nit: probably don't need this


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@38
PS5, Line 38: first key column(s)
            : (`tstamp` and/or `clusterid`)? In this case, since the column 
value might be present anywhere in the index structure,
            : the current query execution plan does not use the index.
Let's stick with a single concrete example, say `tstamp`. Then we can point to 
the example above:
"In the above case, the `tsamp` columns are sorted with respect to `host`, but 
are not globally sorted, and as such, it's non-trivial to use the index to 
filter rows.


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@40
PS5, Line 40:  by default
nit: probably don't need this


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@41
PS5, Line 41: To optimize this scan time, a possible solution is to build 
secondary index on the required key column (although, it might be
            : redundant to build secondary index on composite key column).
            : However, we do not consider this solution as Kudu does not 
support secondary indexes yet.
            :
nit: I think this would read better after L45. E.g.

Other databases may optimize such scans by build secondary indexes (though it 
might be redundant to build one on one of the primary keys). However, this 
isn't an option for Kudu, given its lack of secondary index support. The 
question is, can Kudu do better than a full table scan here?


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@47
PS5, Line 47: column(s)
nit: since this is a concrete example, we know there is only one column before 
`tsamp`


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@50
PS5, Line 50: seek to the rows containing distinct prefix keys
            : and satisfying the query predicate on the `tstamp` column.
nit: reword as "to **skip** to the rows that have distinct prefix keys, and 
also satisfy the predicate on the `tsamp` column."


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@61
PS5, Line 61:  query server
nit: "Kudu tablet" or "tablet server" or "Kudu"


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@61
PS5, Line 61: **scan** all rows for which `host` = `helium` and `tstamp` = 100 
and consequently,
            : **skip** all the rows for which host = `helium` and `tstamp` != 
100
            : (holds true for all distinct keys of `host` such as `ubuntu`, 
`westeros`).
Maybe reverse the order of **skip** and **scan**, since the name is "skip 
scan"? Also isn't the actual order is to skip to a distinct prefix that may 
match a predicate, and then scan through rows until we know that the rows won't 
match the predicate within this prefix key?


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@70
PS5, Line 70: Lower the prefix column cardinality, better the skip scan 
performance
nit: add "the" in front of "Lower" and "better"


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@71
PS5, Line 71: skip scan is not a viable approach.
I seem to recall a plot that showed the performance without the dynamic 
disabling functionality. Do you still have that around? I think that would be 
interesting to put up since it exemplifies this quite well. If not, that's fine.


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@77
PS5, Line 77: seeks
nit: skips? for consistency with the "skip" and "scan" terminology


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@77
PS5, Line 77: prefix column(s)
nit: I think it's clear enough that this may refer to multiple, so maybe just 
"prefix columns' cardinality"?


http://gerrit.cloudera.org:8080/#/c/11263/5/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@89
PS5, Line 89: 1(`host`)
nit: one (`host`)



--
To view, visit http://gerrit.cloudera.org:8080/11263
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: gh-pages
Gerrit-MessageType: comment
Gerrit-Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e
Gerrit-Change-Number: 11263
Gerrit-PatchSet: 5
Gerrit-Owner: Anupama Gupta <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Anupama Gupta <[email protected]>
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Comment-Date: Tue, 04 Sep 2018 19:42:10 +0000
Gerrit-HasComments: Yes

[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Reply via email to