Anupama Gupta has posted comments on this change. ( http://gerrit.cloudera.org:8080/11263 )
Change subject: Blogpost describing index skip scan optimization. ...................................................................... Patch Set 5: (22 comments) Please take a look. http://gerrit.cloudera.org:8080/#/c/11263/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11263/4//COMMIT_MSG@6 PS4, Line 6: : Blogpost describing index skip scan optimization. : > In reviewing blogposts, it's generally helpful to post a link to a rendered Thanks for this Andrew. I am still not sure why images are not getting rendered here -https://github.com/AnupamaGupta01/kudu/blob/blogpost-2/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md. Although I do see the rendered version locally, using jekyll. Please let me know if I am missing something. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md: http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@9 PS4, Line 9: [index skip scan][1]. > This already seems like it's going a bit too far into implementation detail Got it. Done. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@11 PS4, Line 11: <!--more--> : > Probably don't need this. Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@14 PS4, Line 14: > What do these do? This is used to show the beginning excerpts of the post. I misunderstood earlier that it is used for newline. Moved this tag after the beginning two lines and removed this from elsewhere. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@31 PS4, Line 31: > Maybe, add a reference (like https://en.wikipedia.org/wiki/B-tree) in-line Makes sense. Added an in-line reference. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@31 PS4, Line 31: > nit: "a B-tree", and no need to capitalize "Tree" below Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@33 PS4, Line 33: `metric > nit: perhaps using ``s would be more reasonable here (ie. `host`). Then it' Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@32 PS4, Line 32: In this case, by default, Kudu internally builds a primary key index (implemented as a : [B-tree](https://en.wikipedia.org/wiki/B-tree)) for the table `metrics`. : As shown in the table above, the ind > IMO this doesn't convey the idea that the data is sorted by the composite o Thanks for this suggestion. I moved the example dataset from below and used it as a reference to elaborate on the points you mentioned. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@32 PS4, Line 32: In this case, by default, Kudu internally builds a primary key index (implemented as a : [B-tree](https://en.wikipedia.org/wiki/B-tree)) for the table `metrics`. : As shown in the table above, the ind > +1 all points mentioned by Andrew here. Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@39 PS4, Line 39: ? > nit: here and elsewhere, no need for spaces before punctuation marks Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@40 PS4, Line 40: the current query execution plan does not use the index. Instead, a full tab > I'm not sure this gives a clear explanation as for the reason to perform a Rephrased this paragraph to clarify this point. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@45 PS4, Line 45: The question is, > In general, I think the index skip scan optimization is not the only answer Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@46 PS4, Line 46: > The crux of this is the prefixes are also sorted, and all rows of a given p Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@53 PS4, Line 53: For example, consider the query: > nit: maybe, to be in sync with the CREATE TABLE statement above, write SQL Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@51 PS4, Line 51: and satisfying the query predicate on the `tstamp` column. : : For example, consider the query: : {% highlight SQL %} : SELECT clusterid FROM metrics WHERE tstamp = 100; : {% endhighlight %} : > Ah, so you _do_ have an example! I think it'd be helpful setting this up up Moved this example near the beginning. Yes, the data is actually stored like this. This was to emphasize on how Kudu would store the rows due to indexing. Removed "for simplicity" now. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@61 PS4, Line 61: `host` = `heli > nit: backticks here too and everywhere else that has code snippets Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@68 PS4, Line 68: > It's not up to 13x, but it should be arbitrarily optimizable based on the c Makes sense. Removed "up to" here. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@69 PS4, Line 69: This optimizatio > nit: similar to my comments in the patch, it might help defining these up f Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73 PS4, Line 73: : Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performa > Yep, it would be nice to add some details around the data and reasoning bac 1) Yes, these experiments were based on table schema and query pattern mentioned above. 2) Added an explanation for this. We used sqrt() based on our findings from the performance graphs. 3) Actually, I did not experiment with using other criteria to disable skip scan. I do agree that it will be interesting to explore other options too. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73 PS4, Line 73: : Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performa > This could use some explanation as to why sqrt(total_num_rows) was chosen. Added explanation about how we came to using this simple heuristic. Yes, it is the total number of rows in the table. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@78 PS4, Line 78: keys exceeds ![](https://latex.codecogs.com/gif.latex > This is for the schema and query pattern mentioned earlier, right? Maybe, Done http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@82 PS4, Line 82: ![png]({{ site.github.url }}/img/index-skip-scan/skip-scan-performance-graph.png){:height="1000px" width="600px" .img-responsive} : > I generally see the conclusion as a place to summarize, in addition to talk Done -- To view, visit http://gerrit.cloudera.org:8080/11263 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: gh-pages Gerrit-MessageType: comment Gerrit-Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e Gerrit-Change-Number: 11263 Gerrit-PatchSet: 5 Gerrit-Owner: Anupama Gupta <ag3...@columbia.edu> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Anupama Gupta <ag3...@columbia.edu> Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Comment-Date: Tue, 04 Sep 2018 02:00:32 +0000 Gerrit-HasComments: Yes