Re: Lazy JIT IR code generation to increase JIT speed with partitions

David Geier Fri, 27 Jan 2023 01:02:48 -0800

Hi,

Thanks for taking a look!


On 12/1/22 22:12, Dmitry Dolgov wrote:


First to summarize things a bit: from what I understand there are two
suggestions on the table, one is about caching modules when doing
inlining, the second is about actual lazy jitting. Are those to tightly
coupled together, could they be split into two separate patches? It
would make it a bit easier to review and test.

Yes.


I was playing with the caching part of the patch (still have to think
about the lazy jitting), which generally seems to be a good idea.  From
what I see the main concern here is a chance that IRMover will rename
types, degrading performance of the generated code in long run. I have
to admit, I'm not fully sure mentioned LLVM 13 fix [1] completely
eliminates those concerns, somehow its description is formulated in not
very committing way ("don't know if this patch fixes ..., but it does
fix a few soundness issues that have crept in."). But I haven't found
any crashes or false asserts coming from the LLVM side (using v13),
running several rounds of regression tests with forced jitting, so a
point to the fix.

I would be curious to learn how Andres was troubleshooting type renaming
issues? Using LLVM 13 from packages, jitting the same query twice and
dumping the bitcode out showed some difference in types a-la
"%struct.ExprState.142" vs "%struct.ExprState.430" (from what I
understood, the whole thing is an identifier, including the number) with
the patch, but the very same changes are happening on the main branch as
well. Of course, I was inspecting bitcode only for certain queries, it
doesn't exclude that some other examples actually feature type renaming.

In general, it would be interesting to know how to do some sort of "smoke
tests" for the generated code, e.g. in case if LLVM has fixed this
particular issue, but they might reappear in the future?

+1, especially with my findings described below.

I did few performance tests and got numbers similar to posted in the
thread, inlining time being impressively reduced (~10 times) as well as
(suspiciously) optimization time (~1.5 times). The approach was a bit
different though, I've run the sequence of example queries from the
thread using pgbench and checked jit counters from pgss.

Few small notes:

If caching of modules is safe from LLVM >= 13, I guess it should be
wrapped into a corresponding condition on LLVM_VERSION_MAJOR, right?

Why the assert about hasExternalLinkage was removed from the
llvmjit_inline.cpp?

For so much discussion about such a small change there is definitely not
enough commentaries in the code about dangers of cloning modules.

Also, maybe a stupid question, but how big this cache should be? From
what I understand it will be cleared on llvm_shutdown, does it mean it
can grow unbounded if e.g. a single session is firing all the time
different queries at the db?

The cache doesn't contain the modules generated for a query but thebitcode files with the to-be-inlined functions stored on disk. So thecache would stop growing as soon as a connection process has loaded allof them. Nevertheless, if a workload uses many connections and truly allor at least most of the bitcode files that could be quite some memory.

Beyond that I made an unfortunate discovery: the inlining time goes downby so much because no inlining is happening anymore :-/. This is becauseI cloned the module only when passing it to the IRMover, not right aftertaking it from the cache. However, after the module is taken from thecache it's still modified. llvm_execute_inline_plan() executes

if (valueToImport->hasExternalLinkage()) {valueToImport->setLinkage(LinkageTypes::AvailableExternallyLinkage); }

But when checking if a function is inlinable in function_inlinable() wecheck


if (F.hasAvailableExternallyLinkage()) return false;

which makes the code bail out for any function to be inlined.

It's very curious as to why we didn't really see that when dumping thebitcode. It seems like the bitcode is always different enough to notspot that. So +1 on your point on having a smoke test in place to verifythings still work.

If I change the code to clone the module right after taking it from thecache and before making the changes to it, the JIT time remains high andseems even higher than when re-reading the file from disk. Profilingshowed that llvm::CloneModule() is just super slow, especially for bigbitcode files like numeric.bc. I haven't investigated why that is and ifwe can do something about it. I also don't plan to do so for the momentbeing.

For reference, I attached the patch which only contains the cachingcode. It's the variant that clones early.


--
David Geier
(ServiceNow)

From 6a13803de83aaa46d34880e1b7c44bf93eb87f35 Mon Sep 17 00:00:00 2001
From: David Geier <geidav...@gmail.com>
Date: Mon, 27 Jun 2022 12:28:29 +0200
Subject: [PATCH] Cache modules in JIT inlining

---
 src/backend/jit/llvm/llvmjit_inline.cpp | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_inline.cpp b/src/backend/jit/llvm/llvmjit_inline.cpp
index dc35e002f5..b6bc599e60 100644
--- a/src/backend/jit/llvm/llvmjit_inline.cpp
+++ b/src/backend/jit/llvm/llvmjit_inline.cpp
@@ -63,6 +63,9 @@ extern "C"
 #include <llvm/Linker/IRMover.h>
 #include <llvm/Support/ManagedStatic.h>
 #include <llvm/Support/MemoryBuffer.h>
+#if LLVM_VERSION_MAJOR >= 13
+#include <llvm/Transforms/Utils/Cloning.h>
+#endif
 
 
 /*
@@ -378,13 +381,19 @@ llvm_execute_inline_plan(llvm::Module *mod, ImportMapTy *globalsToInline)
 		const llvm::StringSet<>& modGlobalsToInline = toInline.second;
 		llvm::SetVector<llvm::GlobalValue *> GlobalsToImport;
 
-		Assert(module_cache->count(modPath));
-		std::unique_ptr<llvm::Module> importMod(std::move((*module_cache)[modPath]));
-		module_cache->erase(modPath);
-
 		if (modGlobalsToInline.empty())
 			continue;
 
+		auto iter = module_cache->find(modPath);
+		Assert(iter != module_cache->end());
+
+#if LLVM_VERSION_MAJOR >= 13
+		auto importMod = llvm::CloneModule(*iter->second);
+#else
+		auto importMod = std::move(iter->second);
+		module_cache->erase(modPath);
+#endif
+
 		for (auto &glob: modGlobalsToInline)
 		{
 			llvm::StringRef SymbolName = glob.first();
@@ -491,7 +500,11 @@ load_module(llvm::StringRef Identifier)
 	if (LLVMCreateMemoryBufferWithContentsOfFile(path, &buf, &msg))
 		elog(FATAL, "failed to open bitcode file \"%s\": %s",
 			 path, msg);
+#if LLVM_VERSION_MAJOR >= 13
+	if (LLVMParseBitcodeInContext2(LLVMGetGlobalContext(), buf, &mod))
+#else
 	if (LLVMGetBitcodeModuleInContext2(LLVMGetGlobalContext(), buf, &mod))
+#endif
 		elog(FATAL, "failed to parse bitcode in file \"%s\"", path);
 
 	/*
-- 
2.34.1

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Reply via email to