Attila Bukor has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14353 )

Change subject: KUDU-1938 Make UTF-8 truncation faster pt 1
......................................................................

KUDU-1938 Make UTF-8 truncation faster pt 1

This commit adds a fast path for ASCII strings where if the MSB is a
0-bit on each byte in a chunk of string it advances the counter and the
iterator by the chunk size. This way if a chunk contains only ASCII
characters there's no need to count each individual character.

Thanks to Todd Lipcon for the initial idea and Zoltan Chovan and Istvan
Farmosi for the brainstorming and the help in figuring out how this
should be done.

Before:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (6698 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (6161 ms)

After:

[ RUN      ] CharUtilTest.StressTestUtf8
[       OK ] CharUtilTest.StressTestUtf8 (7746 ms)
[ RUN      ] CharUtilTest.StressTestAscii
[       OK ] CharUtilTest.StressTestAscii (1028 ms)

Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Reviewed-on: http://gerrit.cloudera.org:8080/14353
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/char_util-test.cc
M src/kudu/util/char_util.cc
A src/kudu/util/testdata/char_truncate_ascii.txt
A src/kudu/util/testdata/char_truncate_utf8.txt
5 files changed, 421 insertions(+), 11 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/14353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iebb98e18a3619029d9b0bc224c7dead89a3d7374
Gerrit-Change-Number: 14353
Gerrit-PatchSet: 12
Gerrit-Owner: Attila Bukor <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Grant Henke <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

Reply via email to