[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Alexey Serbin (Code Review) Wed, 29 Aug 2018 15:39:38 -0700

Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11263 )


Change subject: Blogpost describing index skip scan optimization.
......................................................................


Patch Set 4:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md
File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md:

http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@31
PS4, Line 31: B-Tree
Maybe, add a reference (like https://en.wikipedia.org/wiki/B-tree) in-line or 
in a separate 'References' section?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@32
PS4, Line 32: The data is sorted lexicographically starting from the leftmost 
primary key column and stored in the B-Tree leaf nodes.
            : Therefore, when the user query contains the first key column 
("host"), Kudu uses the primary key range push down
            : operation to optimize the scan time.
> IMO this doesn't convey the idea that the data is sorted by the composite o
+1 all points mentioned by Andrew here.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@40
PS4, Line 40: (since the primary key index is sorted on the basis of the first 
key column)
I'm not sure this gives a clear explanation as for the reason to perform a full 
table scan.  Could you update this to explain why simply using the primary 
index we cannot instantly locate the desired rows?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@45
PS4, Line 45: The answer is yes
In general, I think the index skip scan optimization is not the only answer.  
In other databases it's possible to build secondary indices, and that might 
work even better (of course it depends on the read/write ratio for the use-case 
and availability of space to build additional index).

I think it's worth mentioning that building secondary index would not be the 
option here since Kudu does not support secondary indices yet.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@53
PS4, Line 53: select clusterid from metrics where tstamp = 100
nit: maybe, to be in sync with the CREATE TABLE statement above, write SQL 
keywords in capital letters.


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@62
PS4, Line 62: popularly known as index skip scan optimization can skip all the 
rows for which host = "helium" and tstamp != 100
> nit: it's great to get to the point that we call this a "skip scan". To dri
Maybe, it's worth mentioning 'skip scan' earlier where you give a short 
overview of the idea behind the skip scan optimization.  Also, as for 
addressing the 'popularity' of the term, I think that adding some references in 
a separate section for various databases that implement that optimization might 
be useful (e.g., one of those links might be 
https://oracle-base.com/articles/9i/index-skip-scanning).


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73
PS4, Line 73: Based on experiments on upto 10 million rows per tablet, we 
decided to disable skip scan when the number of seeks
            : for distinct prefix column values exceeds 
![](https://latex.codecogs.com/gif.latex?%5Csqrt%7B%5C%23total%20rows%7D).
> This could use some explanation as to why sqrt(total_num_rows) was chosen.
Yep, it would be nice to add some details around the data and reasoning backing 
the choice of this disable-skip-scan criterion.

1) As for those experiments, were those using the table schema and query 
pattern mentioned above?  Or those experiments involved some other table 
schemas and query patterns?
2) What was the rationale at the conceptual level to choose that sqrt() metric?
3) If there were multiple candidate criteria to choose from, maybe it's worth 
mentioning those as well?
4) If 3 is true, was the sqrt() criteria a clear winner or there was some 
fuziness and the sqrt() was chosen also because it looks simpler comparing to 
others?


http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@78
PS4, Line 78: The performance graph of this approach is shown below
This is for the schema and query pattern mentioned earlier, right?  Maybe, it's 
worth mentioning that.



-- 
To view, visit http://gerrit.cloudera.org:8080/11263
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: gh-pages
Gerrit-MessageType: comment
Gerrit-Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e
Gerrit-Change-Number: 11263
Gerrit-PatchSet: 4
Gerrit-Owner: Anupama Gupta <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Comment-Date: Wed, 29 Aug 2018 22:39:17 +0000
Gerrit-HasComments: Yes

[kudu-CR](gh-pages) Blogpost describing index skip scan optimization.

Reply via email to