[
https://issues.apache.org/jira/browse/KUDU-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154923#comment-17154923
]
ASF subversion and git services commented on KUDU-636:
------------------------------------------------------
Commit 9287bc2095bfea2713cd743dc3f5bb4cd0f41476 in kudu's branch
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=9287bc2 ]
KUDU-636. Use protobuf arenas for CommitMsgs
This commit optimizes the write path by allowing protobuf arenas to be
used to construct the operation results protobuf and the CommitMsg that
contains it. The operation result for a large batch of writes has one or
more embedded protobuf per inserted row, so using a protobuf arena for
allocation is much more efficient than calling into the system allocator
for each object.
In order to accomplish this, I had to simplify the Log interface a bit.
Previously, the Log code constructed a LogEntryPB and passed that
through to the log's appender thread, even though it had already
performed all of the serialization on the submitting thread. Doing that
required that the log entry retain references to all of the embedded
protobufs, which complicated lifetime quite a bit.
The new Log interface performs all of the serialization and analysis
(including extracting the OpIds of the replicate messages in the batch)
inline in the submission path instead of doing any such work on the
Append thread. With this, the interface now just takes a const protobuf
reference instead of a unique_ptr<CommitMsg>, which means that the
caller has a simpler model around its lifetime.
With the above accomplished, it was straightforward to add a protobuf
Arena to the OpState structure and allocate the CommitMsg and its
constituent sub-messages from that Arena.
The performance benefits are substantial. I benchmarked on a local
machine using:
$ kudu perf loadgen localhost -num_rows_per_thread=10000000 -num_threads=8
and ran the tserver under `perf stat` to collect counters:
Without patch:
INSERT report
rows total: 80000000
time total: 35860.9 ms
time per row: 0.000448261 ms
Performance counter stats for './build/latest/bin/kudu tserver run
-fs-wal-dir /tmp/ts':
378784.92 msec task-clock # 8.453 CPUs utilized
1429039 context-switches # 0.004 M/sec
132930 cpu-migrations # 0.351 K/sec
3128091 page-faults # 0.008 M/sec
1553122880821 cycles # 4.100 GHz
(83.24%)
313764365792 stalled-cycles-frontend # 20.20% frontend cycles
idle (83.33%)
166769392663 stalled-cycles-backend # 10.74% backend cycles
idle (83.39%)
943534760864 instructions # 0.61 insn per cycle
# 0.33 stalled cycles
per insn (83.34%)
170465210875 branches # 450.032 M/sec
(83.39%)
834101556 branch-misses # 0.49% of all branches
(83.32%)
357.520042000 seconds user
21.770448000 seconds sys
With patch:
INSERT report
rows total: 80000000
time total: 32701 ms
time per row: 0.000408763 ms
Performance counter stats for './build/latest/bin/kudu tserver run
-fs-wal-dir /tmp/ts':
272393.27 msec task-clock # 4.915 CPUs utilized
300768 context-switches # 0.001 M/sec
44879 cpu-migrations # 0.165 K/sec
2861143 page-faults # 0.011 M/sec
1126891932279 cycles # 4.137 GHz
(83.28%)
209167186469 stalled-cycles-frontend # 18.56% frontend cycles
idle (83.42%)
144156173079 stalled-cycles-backend # 12.79% backend cycles
idle (83.34%)
925439690437 instructions # 0.82 insn per cycle
# 0.23 stalled cycles
per insn (83.28%)
163672508297 branches # 600.868 M/sec
(83.33%)
655509045 branch-misses # 0.40% of all branches
(83.34%)
257.521990000 seconds user
15.112482000 seconds sys
Summary:
* 9.6% throughput increase
* 39% reduction in tserver cycles
Change-Id: I78698d4cb4944bddd8dabd6cbbf1e3a064056199
Reviewed-on: http://gerrit.cloudera.org:8080/16147
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <[email protected]>
> optimization: we spend a lot of time in alloc/free
> --------------------------------------------------
>
> Key: KUDU-636
> URL: https://issues.apache.org/jira/browse/KUDU-636
> Project: Kudu
> Issue Type: Improvement
> Components: perf
> Affects Versions: Public beta
> Reporter: Todd Lipcon
> Priority: Major
>
> Looking at a workload in the cluster, several of the top 10 lines of perf
> report are tcmalloc-related. It seems like we don't do a good job of making
> use of the per-thread free-lists, and we end up in a lot of contention on the
> central free list. There are a few low-hanging fruit things we could do to
> improve this for a likely perf boost.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)