Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13721 )

Change subject: KUDU-2847: Optimize iteration over selection vector in 
SerializeRowBlock
......................................................................


Patch Set 1:

I chatted with Zhang Yao last night on Slack, but reproducing the content here 
in case others want to follow along:

Long ago, we used to do this conversion row-by-row, and moved to 
column-by-column in commit 3cd5496710351f76a72841f7a691fb1bfd50bb9a back in 
2013. This had some big performance improvements as measured by the benchmark 
in wire_protocol-test.

However, that benchmark is quite simplistic -- it only tests the case when all 
rows are selected, and only for a schema with three columns. In KUDU-2847 (and 
in other places over the last few years) we found that the columnar conversion 
is a bottleneck in other situations. In particular:

- if only a small number of rows are selected, or if around half of the rows 
are selected (resulting in small runs) then the existing CopyColumn path spends 
a lot of work in branches trying to find runs (BitmapFindFirst, etc). We could 
probably improve this bitmap iteration code somewhat, but this issue is 
magnifed by:
- we do the iteration over the selection vector once per column, so if there 
are many columns, we don't get to amortize that cost.

It looks like the patch here doesn't go fully back to "row-by-row", but instead 
switches to one iteration over the selection vector, and then for each run, 
iterates column-wise. That might be a good compromise, since it avoids the 
per-column selvec iteration cost while also avoiding per-row iteration over the 
schema.

That said, I suggested that Zhang Yao extend the benchmarks to test all 
combinations of the two important dimensions:
- schema column count (narrow, medium, wide), perhaps testing both STRING typed 
columns and primitives
- selection vector pattern (all set, mostly set, half set, few set)

If we have a good set of benchmarks to work from, I think we'll be in a better 
place to evaluate different options here.


--
To view, visit http://gerrit.cloudera.org:8080/13721
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I19917d1875c46fd4cf98ef8a471b0340a76161e7
Gerrit-Change-Number: 13721
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Yao Xu <[email protected]>
Gerrit-Comment-Date: Wed, 26 Jun 2019 16:09:42 +0000
Gerrit-HasComments: No

Reply via email to