Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13721 )
Change subject: KUDU-2847: Optimize iteration over selection vector in SerializeRowBlock ...................................................................... Patch Set 1: I chatted with Zhang Yao last night on Slack, but reproducing the content here in case others want to follow along: Long ago, we used to do this conversion row-by-row, and moved to column-by-column in commit 3cd5496710351f76a72841f7a691fb1bfd50bb9a back in 2013. This had some big performance improvements as measured by the benchmark in wire_protocol-test. However, that benchmark is quite simplistic -- it only tests the case when all rows are selected, and only for a schema with three columns. In KUDU-2847 (and in other places over the last few years) we found that the columnar conversion is a bottleneck in other situations. In particular: - if only a small number of rows are selected, or if around half of the rows are selected (resulting in small runs) then the existing CopyColumn path spends a lot of work in branches trying to find runs (BitmapFindFirst, etc). We could probably improve this bitmap iteration code somewhat, but this issue is magnifed by: - we do the iteration over the selection vector once per column, so if there are many columns, we don't get to amortize that cost. It looks like the patch here doesn't go fully back to "row-by-row", but instead switches to one iteration over the selection vector, and then for each run, iterates column-wise. That might be a good compromise, since it avoids the per-column selvec iteration cost while also avoiding per-row iteration over the schema. That said, I suggested that Zhang Yao extend the benchmarks to test all combinations of the two important dimensions: - schema column count (narrow, medium, wide), perhaps testing both STRING typed columns and primitives - selection vector pattern (all set, mostly set, half set, few set) If we have a good set of benchmarks to work from, I think we'll be in a better place to evaluate different options here. -- To view, visit http://gerrit.cloudera.org:8080/13721 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I19917d1875c46fd4cf98ef8a471b0340a76161e7 Gerrit-Change-Number: 13721 Gerrit-PatchSet: 1 Gerrit-Owner: ZhangYao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Yao Xu <[email protected]> Gerrit-Comment-Date: Wed, 26 Jun 2019 16:09:42 +0000 Gerrit-HasComments: No
