Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11263 )
Change subject: Blogpost describing index skip scan optimization. ...................................................................... Patch Set 4: (8 comments) http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md File _posts/2018-08-17-index-skip-scan-optimization-in-kudu.md: http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@31 PS4, Line 31: B-Tree Maybe, add a reference (like https://en.wikipedia.org/wiki/B-tree) in-line or in a separate 'References' section? http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@32 PS4, Line 32: The data is sorted lexicographically starting from the leftmost primary key column and stored in the B-Tree leaf nodes. : Therefore, when the user query contains the first key column ("host"), Kudu uses the primary key range push down : operation to optimize the scan time. > IMO this doesn't convey the idea that the data is sorted by the composite o +1 all points mentioned by Andrew here. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@40 PS4, Line 40: (since the primary key index is sorted on the basis of the first key column) I'm not sure this gives a clear explanation as for the reason to perform a full table scan. Could you update this to explain why simply using the primary index we cannot instantly locate the desired rows? http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@45 PS4, Line 45: The answer is yes In general, I think the index skip scan optimization is not the only answer. In other databases it's possible to build secondary indices, and that might work even better (of course it depends on the read/write ratio for the use-case and availability of space to build additional index). I think it's worth mentioning that building secondary index would not be the option here since Kudu does not support secondary indices yet. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@53 PS4, Line 53: select clusterid from metrics where tstamp = 100 nit: maybe, to be in sync with the CREATE TABLE statement above, write SQL keywords in capital letters. http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@62 PS4, Line 62: popularly known as index skip scan optimization can skip all the rows for which host = "helium" and tstamp != 100 > nit: it's great to get to the point that we call this a "skip scan". To dri Maybe, it's worth mentioning 'skip scan' earlier where you give a short overview of the idea behind the skip scan optimization. Also, as for addressing the 'popularity' of the term, I think that adding some references in a separate section for various databases that implement that optimization might be useful (e.g., one of those links might be https://oracle-base.com/articles/9i/index-skip-scanning). http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@73 PS4, Line 73: Based on experiments on upto 10 million rows per tablet, we decided to disable skip scan when the number of seeks : for distinct prefix column values exceeds . > This could use some explanation as to why sqrt(total_num_rows) was chosen. Yep, it would be nice to add some details around the data and reasoning backing the choice of this disable-skip-scan criterion. 1) As for those experiments, were those using the table schema and query pattern mentioned above? Or those experiments involved some other table schemas and query patterns? 2) What was the rationale at the conceptual level to choose that sqrt() metric? 3) If there were multiple candidate criteria to choose from, maybe it's worth mentioning those as well? 4) If 3 is true, was the sqrt() criteria a clear winner or there was some fuziness and the sqrt() was chosen also because it looks simpler comparing to others? http://gerrit.cloudera.org:8080/#/c/11263/4/_posts/2018-08-17-index-skip-scan-optimization-in-kudu.md@78 PS4, Line 78: The performance graph of this approach is shown below This is for the schema and query pattern mentioned earlier, right? Maybe, it's worth mentioning that. -- To view, visit http://gerrit.cloudera.org:8080/11263 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: gh-pages Gerrit-MessageType: comment Gerrit-Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e Gerrit-Change-Number: 11263 Gerrit-PatchSet: 4 Gerrit-Owner: Anupama Gupta <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Comment-Date: Wed, 29 Aug 2018 22:39:17 +0000 Gerrit-HasComments: Yes
