Hello, Based on this thread, Alexandra and I decided to investigate if we could borrow some passes from -O1 and add on to the default optimization of -O0 and mem2reg. To determine what passes would make most sense, we ran ICW with jit_above_cost set to 0, dumped all the backends and then analyzed them with 'opt'. Based on the stats dumped that the instcombine pass and sroa had the most scope for optimization. We have attached the stats we dumped.
Then, we investigated whether mixing in sroa and instcombine gave us a better run time. We used TPCH Q1 (TPCH repo we used: https://github.com/dimitri/tpch-citus) at scales of 1, 5 and 50. We found that there was no significant difference in query runtime over the default of -O0 with mem2reg. We also performed the same experiment with -O1 as the default optimization level, as Andres had suggested on this thread. We found that the results were much more promising (refer the results for scale = 5 and 50 below). At the lower scale of 1, we had to force optimization to meet the query cost. There was no adverse impact from increased query optimization time due to the ramp up to -O1 at this lower scale. Results summary (eyeball-averaged over 5 runs, excluding first run after restart. For each configuration we flushed the OS cache and restarted the database): settings: max_parallel_workers_per_gather = 0 scale = 50: -O3 : 77s -O0 + mem2reg : 107s -O0 + mem2reg + instcombine : 107s -O0 + mem2reg + sroa : 107s -O0 + mem2reg + sroa + instcombine : 107s -O1 : 84s scale = 5: -O3 : 8s -O0 + mem2reg : 10s -O0 + mem2reg + instcombine : 10s -O0 + mem2reg + sroa : 10s -O0 + mem2reg + sroa + instcombine : 10s -O1 : 8s scale = 1: -O3 : 1.7s -O0 + mem2reg : 1.7s -O0 + mem2reg + instcombine : 1.7s -O0 + mem2reg + sroa : 1.7s -O0 + mem2reg + sroa + instcombine : 1.7s -O1 : 1.7s Based on the evidence above, maybe it is worth considering ramping up the default optimization level to -O1. Regards, Soumyadeep and Alexandra
opt_dump.pdf
Description: Adobe PDF document