[ 
https://issues.apache.org/jira/browse/KUDU-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905984#comment-15905984
 ] 

Todd Lipcon commented on KUDU-1930:
-----------------------------------

Tested with commands like:
{code}
./build/latest/bin/tpch_real_world -tpch_path_to_ts_flags_file ./tsflags  
-tpch_scaling_factor 100 -tpch_num_inserters 8 -notpch_run_queries 
-tpch_path_to_dbgen_dir /data/2/mpercy/tpch_2_17_0/dbgen 
-tpch_partition_strategy hash
{code}
(the new 'hash' partition strategy is from a simple local patch)

Results:
{code}
with 1 MM thread:
I0310 13:53:38.925390  1568 tpch_real_world.cc:278] Time spent by thread 2 to 
load generated data into the database: real 1140.187s     user 398.411s   sys 
5.315s

with 4 MM threads:
I0310 14:06:35.509299  8524 tpch_real_world.cc:278] Time spent by thread 4 to 
load generated data into the database: real 618.348s      user 413.431s   sys 
5.437s

with 4 MM threads, hash partition:
I0310 15:32:52.386118 27623 tpch_real_world.cc:289] Time spent by thread 2 to 
load generated data into the database: real 1233.386s     user 462.671s   sys 
6.084s

with 4 MM threads, using dense_hash_map instead of std::unordered_map for 
dictionary builder:
I0310 17:26:00.682138 32076 tpch_real_world.cc:289] Time spent by thread 0 to 
load generated data into the database: real 1147.478s     user 464.147s   sys 
6.200s
{code}

The "user" times here are from the client side, so not that relevant, whereas 
"real" is the total wall time taken. It seems like dense_hash_map is an easy 7% 
speedup relative to the STL map. As we've long known, inserting in sorted order 
(range partitioned) is 2x faster than non-sorted order (and the longer the 
benchmark runs, the more the difference magnifies)

> Improve performance of dictionary builder
> -----------------------------------------
>
>                 Key: KUDU-1930
>                 URL: https://issues.apache.org/jira/browse/KUDU-1930
>             Project: Kudu
>          Issue Type: Bug
>          Components: cfile, perf
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> I locally tweaked tpch_real_world to use hash partitioning instead of range 
> partitioning, so that the different threads overlapped on the same tablets, 
> simulating a more realistic parallel load scenario. I noticed that the MM 
> threads were CPU bound, with a high percentage of CPU in AddCodeWords(). 
> Initial prototypes indicate that optimizing the hashmap used here would be an 
> easy win.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to