codegen: fix regression in inlining after LLVM 4.0 upgrade
With the upgrade to LLVM 4.0, the performance of our generated code
regressed significantly. After looking at the generated assembly from
codegen-test, I could see that there were many 'call' instructions that
didn't appear in the earlier output, which led me to suspect the
inliner.
After adding a 'module->dump()' call and looking at the output, I could
see that the calls were to utility functions like BitmapTest() which
should be inlined due to their very small size. However, the function
was marked 'noinline' in precompiled.ll. After a bit of Googling I came
across a thread[1] in which someone else noticed that all of their
functions had unexpected 'noinline' attributes after upgrading to 4.0.
After following the breadcrumbs, I found some advice to change the way
in which we emit precompiled.ll to disable the built-in LLVM passes
which were responsible for adding this attribute.
This had one unintended side-effect which initially broke codegen in
TSAN builds (which use libc++ instead of libstdcxx). Fixing this required
changing the debug builds to still run the most basic ('-O0')
optimization passes at JIT time instead of skipping the optimization
pass manager entirely. Additionally I needed to explicitly run Global
Dead Code Elimination (GlobalDCE). See the new comments in
module_builder.cc for details.
That also caused the size of generated code in UBSAN builds to increase,
so I had to increase the limit on the number of dumped instructions per
function to avoid breaking one of the tests.
During my investigation, I also found that we should have been passing
the optimization and size levels into the inlining pass, so I threw
that change in for good measure.
This fixed the perf regression. I benchmarked using:
memrowset-test --roundtrip_num_rows=10000000 \
--gtest_filter=\*InsertCount\* \
--gtest_repeat=10
LLVM 3.9 (before the regression) (asm at [2]):
I0607 12:00:17.540612 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.295s user 0.294s sys 0.000s
I0607 12:00:23.842226 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.297s sys 0.000s
I0607 12:00:29.710149 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.296s user 0.296s sys 0.000s
I0607 12:00:35.587321 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.296s user 0.296s sys 0.000s
I0607 12:00:41.451495 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.296s user 0.296s sys 0.000s
I0607 12:00:47.309710 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.297s sys 0.000s
I0607 12:00:53.183537 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.297s sys 0.000s
I0607 12:00:59.051522 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.297s sys 0.000s
I0607 12:01:04.935763 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.298s sys 0.000s
I0607 12:01:10.821828 23532 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.297s user 0.299s sys 0.000s
LLVM 4.0 (before this fix) (asm at [3]):
I0607 12:14:07.771476 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.524s user 0.525s sys 0.000s
I0607 12:14:14.283260 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.525s user 0.525s sys 0.000s
I0607 12:14:20.376360 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.543s user 0.543s sys 0.000s
I0607 12:14:26.457681 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.530s user 0.530s sys 0.000s
I0607 12:14:32.535305 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.532s user 0.532s sys 0.000s
I0607 12:14:38.615105 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.531s user 0.532s sys 0.000s
I0607 12:14:44.697655 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.531s user 0.531s sys 0.000s
I0607 12:14:50.779160 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.532s user 0.532s sys 0.000s
I0607 12:14:56.852665 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.531s user 0.530s sys 0.000s
I0607 12:15:02.954947 12117 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.538s user 0.540s sys 0.000s
LLVM 4.0 (after this fix) (asm at [4]):
I0607 12:17:37.953876 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.292s user 0.292s sys 0.000s
I0607 12:17:44.214912 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.295s user 0.295s sys 0.000s
I0607 12:17:50.057200 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.299s user 0.299s sys 0.000s
I0607 12:17:55.882935 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.292s user 0.292s sys 0.000s
I0607 12:18:01.725445 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.293s sys 0.000s
I0607 12:18:07.562417 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.294s sys 0.000s
I0607 12:18:13.421221 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.293s sys 0.000s
I0607 12:18:19.268613 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.293s sys 0.000s
I0607 12:18:25.113490 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.294s sys 0.000s
I0607 12:18:30.950954 17916 memrowset-test.cc:446] Time spent Scanning rows
where all are committed: real 0.293s user 0.294s sys 0.000s
Looking at the assembly for the three tests above, it's clear that the
inlining is broken in 4.0 without this fix, and with it, the generated
assembly is a little better (shorter) than 3.9.
[1] http://lists.llvm.org/pipermail/llvm-dev/2017-May/113175.html
[2] https://gist.github.com/b4b0365ea1570f33d55c6c3dffd4a8af
[3] https://gist.github.com/128ecb292fddd069d5397a76dc53fc11
[4] https://gist.github.com/0cb01f454737d6c2dc457f0cdb6bb52d
Change-Id: I7e449df80e5cd02b9ce2dbf4daa5cf8151007dcf
Reviewed-on: http://gerrit.cloudera.org:8080/7099
Reviewed-by: Todd Lipcon <[email protected]>
Tested-by: Todd Lipcon <[email protected]>
Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/2d30f3a9
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/2d30f3a9
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/2d30f3a9
Branch: refs/heads/master
Commit: 2d30f3a96dd416e7991f552c9bf5189d178d8945
Parents: cf68624
Author: Todd Lipcon <[email protected]>
Authored: Tue Jun 6 16:54:54 2017 -0700
Committer: Todd Lipcon <[email protected]>
Committed: Thu Jun 8 20:28:08 2017 +0000
----------------------------------------------------------------------
src/kudu/codegen/CMakeLists.txt | 9 +++++++
src/kudu/codegen/code_generator.cc | 2 +-
src/kudu/codegen/codegen-test.cc | 8 ++++++
src/kudu/codegen/module_builder.cc | 46 +++++++++++++++++++++++++--------
4 files changed, 53 insertions(+), 12 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/kudu/blob/2d30f3a9/src/kudu/codegen/CMakeLists.txt
----------------------------------------------------------------------
diff --git a/src/kudu/codegen/CMakeLists.txt b/src/kudu/codegen/CMakeLists.txt
index 0357e38..275de47 100644
--- a/src/kudu/codegen/CMakeLists.txt
+++ b/src/kudu/codegen/CMakeLists.txt
@@ -117,6 +117,15 @@ list(REMOVE_ITEM IR_FLAGS "-fsanitize=thread"
"-DTHREAD_SANITIZER")
# again at runtime.
list(REMOVE_ITEM IR_FLAGS "-O3" "-O2" "-O1" "-Os")
+# Disable built-in LLVM passes which would add 'noinline' attributes to all
+# standalone functions.
+#
+# This is per the advice in https://reviews.llvm.org/D28053#629914 which says:
+#
+# "This will generate a frontend-optimized but backend pristine bitcode file
that
+# can be processed more or less depending on the desire of the user..."
+list(APPEND IR_FLAGS "-O" "-mllvm" "-disable-llvm-optzns")
+
# We need a library which depends on the IR source, because CMake+Ninja
# doesn't support IMPLICIT_DEPENDS in ADD_CUSTOM_COMMAND.
#
http://git-wip-us.apache.org/repos/asf/kudu/blob/2d30f3a9/src/kudu/codegen/code_generator.cc
----------------------------------------------------------------------
diff --git a/src/kudu/codegen/code_generator.cc
b/src/kudu/codegen/code_generator.cc
index 337321b..8be6388 100644
--- a/src/kudu/codegen/code_generator.cc
+++ b/src/kudu/codegen/code_generator.cc
@@ -218,7 +218,7 @@ Status CodeGenerator::CompileRowProjector(const Schema&
base, const Schema& proj
RETURN_NOT_OK(RowProjectorFunctions::Create(base, proj, out, &tm));
if (FLAGS_codegen_dump_mc) {
- static const int kInstrMax = 500;
+ static const int kInstrMax = 1500;
std::ostringstream sstr;
sstr << "Printing read projection function:\n";
int instrs = DumpAsm((*out)->read(), *tm, &sstr, kInstrMax);
http://git-wip-us.apache.org/repos/asf/kudu/blob/2d30f3a9/src/kudu/codegen/codegen-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/codegen/codegen-test.cc b/src/kudu/codegen/codegen-test.cc
index dccf155..a63b064 100644
--- a/src/kudu/codegen/codegen-test.cc
+++ b/src/kudu/codegen/codegen-test.cc
@@ -258,6 +258,14 @@ TEST_F(CodegenTest, ObservablesTest) {
ASSERT_EQ(iwith->is_identity(), iwithout.is_identity());
ASSERT_TRUE(iwith->is_identity());
}
+
+// Test empty projection
+TEST_F(CodegenTest, TestEmpty) {
+ Schema empty;
+ TestProjection<true>(&empty);
+ TestProjection<false>(&empty);
+}
+
// Test key projection
TEST_F(CodegenTest, TestKey) {
Schema key = base_.CreateKeyProjection();
http://git-wip-us.apache.org/repos/asf/kudu/blob/2d30f3a9/src/kudu/codegen/module_builder.cc
----------------------------------------------------------------------
diff --git a/src/kudu/codegen/module_builder.cc
b/src/kudu/codegen/module_builder.cc
index 9f825e6..61d9e13 100644
--- a/src/kudu/codegen/module_builder.cc
+++ b/src/kudu/codegen/module_builder.cc
@@ -41,6 +41,7 @@
#include <llvm/Support/raw_os_ostream.h>
#include <llvm/Support/SourceMgr.h>
#include <llvm/Transforms/IPO.h>
+#include <llvm/Transforms/IPO/AlwaysInliner.h>
#include <llvm/Transforms/IPO/PassManagerBuilder.h>
#include "kudu/codegen/precompiled.ll.h"
@@ -210,16 +211,33 @@ void ModuleBuilder::AddJITPromise(llvm::Function* llvm_f,
namespace {
-#if CODEGEN_MODULE_BUILDER_DO_OPTIMIZATIONS
-
-void DoOptimizations(ExecutionEngine* engine,
- Module* module,
+void DoOptimizations(Module* module,
const unordered_set<string>& external_functions) {
PassManagerBuilder pass_builder;
- pass_builder.OptLevel = 2;
// Don't optimize for code size (this corresponds to -O2/-O3)
pass_builder.SizeLevel = 0;
- pass_builder.Inliner = llvm::createFunctionInliningPass();
+#if CODEGEN_MODULE_BUILDER_DO_OPTIMIZATIONS
+ pass_builder.OptLevel = 2;
+ pass_builder.Inliner =
llvm::createFunctionInliningPass(pass_builder.OptLevel,
+
pass_builder.SizeLevel);
+#else
+ // Even if we don't want to do optimizations, we have to run the
"AlwaysInliner" pass.
+ // This pass ensures that any functions marked 'always_inline' are inlined,
but nothing
+ // else.
+ //
+ // If we don't, the following happens:
+ // - symbols in libc++ (eg
_ZNKSt3__19basic_iosIcNS_11char_traitsIcEEE5rdbufEv) are
+ // marked as __attribute__((always_inline)) in the header.
+ // - those symbols end up included with 'local' visibility in libc++.so,
since the compiler
+ // knows that all call sites should inline them.
+ // - if we don't run any inliner at all, then our generated code generates
LLVM
+ // 'invoke' instructions to try to call these external functions, despite
them
+ // being marked 'always_inline'.
+ // - these 'invoke' instructions fail to link at runtime since they can't
find the
+ // dynamic symbol (due to its local visibility)
+ pass_builder.OptLevel = 0;
+ pass_builder.Inliner = llvm::createAlwaysInlinerLegacyPass();
+#endif
FunctionPassManager fpm(module);
pass_builder.populateFunctionPassManager(fpm);
@@ -239,6 +257,16 @@ void DoOptimizations(ExecutionEngine* engine,
module_passes.add(llvm::createInternalizePass([&](const GlobalValue& v) {
return ContainsKey(external_functions, v.getGlobalIdentifier());
}));
+
+ // Run Global Dead Code Elimination.
+ //
+ // This is responsible for removing any unreferenced functions. This is
+ // important to do even in -O0 to workaround an issue we see when our
generated
+ // functions are actually empty. In that case, for whatever reason (perhaps
a bug in LLVM?)
+ // the compiled module would try to include versions of functions with calls
to
+ // other functions marked "alwaysinline". The latter functions would not get
linked
+ // in our compiled module, and then the module would fail to load.
+ module_passes.add(llvm::createGlobalDCEPass());
pass_builder.populateModulePassManager(module_passes);
// Same as above, the result here just indicates whether optimization made
any changes.
@@ -246,8 +274,6 @@ void DoOptimizations(ExecutionEngine* engine,
ignore_result(module_passes.run(*module));
}
-#endif
-
vector<string> GetHostCPUAttrs() {
// LLVM's ExecutionEngine expects features to be enabled or disabled with a
list
// of strings like ["+feature1", "-feature2"].
@@ -288,9 +314,7 @@ Status ModuleBuilder::Compile(unique_ptr<ExecutionEngine>*
out) {
}
module->setDataLayout(target_->createDataLayout());
-#if CODEGEN_MODULE_BUILDER_DO_OPTIMIZATIONS
- DoOptimizations(local_engine.get(), module, GetFunctionNames());
-#endif
+ DoOptimizations(module, GetFunctionNames());
// Compile the module
local_engine->finalizeObject();