Prasanth Jayachandran created HIVE-17220:
--------------------------------------------

             Summary: Bloomfilter probing in semijoin reduction is thrashing L1 
dcache
                 Key: HIVE-17220
                 URL: https://issues.apache.org/jira/browse/HIVE-17220
             Project: Hive
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran


[~gopalv] observed perf profiles showing bloomfilter probes as bottleneck for 
some of the TPC-DS queries and resulted L1 data cache thrashing. 

This is because of the huge bitset in bloom filter that doesn't fit in any 
levels of cache, also the hash bits corresponding to a single key map to 
different segments of bitset which are spread out. This can result in K-1 
memory access (K being number of hash functions) in worst case for every key 
that gets probed because of locality miss in L1 cache. 

Ran a JMH microbenchmark to verify the same. Following is the JMH perf profile 
for bloom filter probing
{code}
Perf stats:
--------------------------------------------------

       5101.935637      task-clock (msec)         #    0.461 CPUs utilized
               346      context-switches          #    0.068 K/sec
               336      cpu-migrations            #    0.066 K/sec
             6,207      page-faults               #    0.001 M/sec
    10,016,486,301      cycles                    #    1.963 GHz                
      (26.90%)
     5,751,692,176      stalled-cycles-frontend   #   57.42% frontend cycles 
idle     (27.05%)
   <not supported>      stalled-cycles-backend
    14,359,914,397      instructions              #    1.43  insns per cycle
                                                  #    0.40  stalled cycles per 
insn  (33.78%)
     2,200,632,861      branches                  #  431.333 M/sec              
      (33.84%)
         1,162,860      branch-misses             #    0.05% of all branches    
      (33.97%)
     1,025,992,254      L1-dcache-loads           #  201.099 M/sec              
      (26.56%)
       432,663,098      L1-dcache-load-misses     #   42.17% of all L1-dcache 
hits    (14.49%)
       331,383,297      LLC-loads                 #   64.952 M/sec              
      (14.47%)
           203,524      LLC-load-misses           #    0.06% of all LL-cache 
hits     (21.67%)
   <not supported>      L1-icache-loads
         1,633,821      L1-icache-load-misses     #    0.320 M/sec              
      (28.85%)
       950,368,796      dTLB-loads                #  186.276 M/sec              
      (28.61%)
       246,813,393      dTLB-load-misses          #   25.97% of all dTLB cache 
hits   (14.53%)
            25,451      iTLB-loads                #    0.005 M/sec              
      (14.48%)
            35,415      iTLB-load-misses          #  139.15% of all iTLB cache 
hits   (21.73%)
   <not supported>      L1-dcache-prefetches
           175,958      L1-dcache-prefetch-misses #    0.034 M/sec              
      (28.94%)

      11.064783140 seconds time elapsed
{code}

This shows 42.17% of L1 data cache misses. 

This jira is to use cache efficient bloom filter for semijoin probing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to