[ 
https://issues.apache.org/jira/browse/KUDU-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185485#comment-17185485
 ] 

ASF subversion and git services commented on KUDU-2844:
-------------------------------------------------------

Commit 4b3e884c09f0b4e8871273a29590e8c56f3c5df7 in kudu's branch 
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4b3e884 ]

KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena

This changes Dictionary and Plain binary blocks to no longer copy string
values into the destination RowBlock's Arena. Instead, the dictionary
block (or plain block) is attached to the RowBlockMemory's list of
reference-counted block handles, and the cell directly refers to the
underlying block handle.

I modified full_stack-insert-scan-test to have it do some scans with
fault tolerance and no caching. If I comment out the code path that
reference-counts the blocks during a scan, the newly added test fails.

I performance-tested this by loading a lineitem table using
tpch_real_world:

$ tpch_real_world -tpch-path-to-dbgen-dir /data/2/todd/dbgen/ \
  -tpch_num_inserters=8 -tpch_scaling_factor=100 \
  --tpch_mini_cluster_base_dir /data/1/todd/tpch-kudu-data \

For the numbers reported below, I accidentally cancelled the load after
222M rows were present, but I used the same dataset for both "before"
and "after" so the relative comparison is still valid.

I started a tserver and master with the same data directories, with
the tserver running inside perf stat (or perf record to look at
profile):

$ kudu-master \
  -fs-wal-dir /data/1/todd/tpch-kudu-data/master-0/wal/ \
  -fs-data-dirs /data/1/todd/tpch-kudu-data/master-0/data/

$ perf stat kudu-tserver \
  -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ \
  -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/

I waited until the data had been fully flushed from MRS and compacted
before running the read workloads. To test the reads I ran the following
10 times:

$ kudu perf table_scan localhost tpch_real_world  \
  --columns l_shipdate,l_shipmode,l_comment --num_threads=16

The results of the first test were a bit noisy due to NUMA placement
issues -- some runs were 30-40% faster than other runs, even on the same
build, which made it hard to compare results, even though it was clear
that the optimized version used fewer cycles on average. So, I ran both
the tserver and the client using 'numactl -m 0 -N 0' to force everything
to a single NUMA node. This made results much more consistent.

Before:

         255870.36 msec task-clock                #    3.058 CPUs utilized
            244847      context-switches          #    0.957 K/sec
              3322      cpu-migrations            #    0.013 K/sec
            245814      page-faults               #    0.961 K/sec
     1066864136000      cycles                    #    4.170 GHz                
      (83.46%)
       84410991344      stalled-cycles-frontend   #    7.91% frontend cycles 
idle     (83.37%)
      340913242391      stalled-cycles-backend    #   31.95% backend cycles 
idle      (83.25%)
     1131564485394      instructions              #    1.06  insn per cycle
                                                  #    0.30  stalled cycles per 
insn  (83.34%)
      187879069908      branches                  #  734.274 M/sec              
      (83.32%)
        8550168935      branch-misses             #    4.55% of all branches    
      (83.26%)

     191.262870000 seconds user
      64.765755000 seconds sys

After:

         214131.49 msec task-clock                #    2.750 CPUs utilized
            245357      context-switches          #    0.001 M/sec
              2734      cpu-migrations            #    0.013 K/sec
            248108      page-faults               #    0.001 M/sec
      893270854012      cycles                    #    4.172 GHz                
      (83.45%)
       83805641687      stalled-cycles-frontend   #    9.38% frontend cycles 
idle     (83.25%)
      345166097238      stalled-cycles-backend    #   38.64% backend cycles 
idle      (83.29%)
      913435059189      instructions              #    1.02  insn per cycle
                                                  #    0.38  stalled cycles per 
insn  (83.36%)
      142198832288      branches                  #  664.072 M/sec              
      (83.36%)
        4819907752      branch-misses             #    3.39% of all branches    
      (83.29%)

      77.876854360 seconds time elapsed

     146.195821000 seconds user
      68.113598000 seconds sys

To summarize, the change gives about 1.30x reduction in CPU cycles on the
tserver. The wall clock reported by the perf scan tool showed a 15-20%
reduction in wall-clock. Testing just scanning a dictionary-encoded column
shows even better results.

Looking at 'perf diff -c ratio -o 0 --sort=sym' between two profiles collected
by perf-record, we can see that the BinaryDictBlockDecoder code path is much
cheaper, most of the memcpy calls are removed, and slightly more CPU is spent
serializing the result (probably due to reduced locality of reference copying
the string data to the wire).

    Orig % CPU     Ratio      Symbol
-------------------------------------------------------------------------------------
    26.16%        0.437272  [.] 
kudu::cfile::BinaryDictBlockDecoder::CopyNextDecodeStrings(unsigned long*, 
kudu::ColumnDataView*)
    19.63%        1.135304  [.] kudu::SerializeRowBlock(kudu::RowBlock const&, 
kudu::Schema const*, kudu::faststring*, kudu::faststring*, bool)
    11.49%        1.002845  [k] copy_user_enhanced_fast_string
    10.29%        0.068955  [.] __memcpy_ssse3_back

Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441
Reviewed-on: http://gerrit.cloudera.org:8080/15802
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <[email protected]>


> Avoid copying strings from dictionary or plain-encoded blocks
> -------------------------------------------------------------
>
>                 Key: KUDU-2844
>                 URL: https://issues.apache.org/jira/browse/KUDU-2844
>             Project: Kudu
>          Issue Type: Improvement
>          Components: cfile, perf
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>         Attachments: fg.svg
>
>
> When scanning a plain or dictionary-encoded binary column, we currently loop 
> over each entry and copy the string into the destination RowBlock's arena. In 
> TPCH Q1, the scanner threads use a significant percentage of CPU doing this 
> copying, and it also increases CPU cache footprint which likely decreases 
> performance in downstream operations like predicate evaluation, merging, 
> result serialization, etc.
> Instead of doing this, we could "attach" the dictionary block (with 
> ref-counting) to the RowBlock and refer directly to the dictionary entry from 
> the RowBlock. When the RowBlock eventually is reset, we can drop the 
> reference. This should be safe because we never mutate indirect data in-place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to