Hi! On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: > I want to share results of my last research of implementing LSM index in > Postgres. > Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree > instead of classical B-Tree.
I wouldn't say it that way. I would say they are providing LSM in addition to the B-tree. For instance WiredTiger (which is the main engine for MongoDB) provides both B-tree and LSM. As I know Tarantool provides at least an in-memory B-tree. If RocksDB is used as an engine for MySQL, then it's also an addition to B-tree, which is provided by InnoDB. Also, implementation of B-tree's in mentioned DBMSes are very different. I would say none of them is purely classical. > LSM approach tries to address this problem. LSM has great use-cases for sure. > I have significantly rewriten their FDW implementation: original RocksDB > server implementation was single threaded. > I have made it multitheaded making it possible to run multiple RocksDB > requests in parallel. > My implementation can be found there: > https://github.com/postgrespro/lsm Great, thank you for your work. > Some results of benchmarking. > Benchmark is just insertion of 250 millions of records with random key in > inclusive index containing one bigint key and 8 bigint fields. > SIze of index is about 20Gb and target system has 16GB of RAM: What storage do you use? > As you can see there is about 10 times difference. > May be somebody will find useful this idea of using IOT (index organized > table) based on RocksDB in Postgres. > But this approach violates all ACID properties of Postgres: > there is no atomicity and consistency (in principle RocksDB supports 2PC, but > it is not used here), > isolation corresponds to something like "read uncommitted", > and concerning durability - it is all up to RocksDB and I have serious > doubts that it will survive failure especially with sync write mode disabled. > So I considered this project mostly as prototype for estimating efficiency of > LSM approach. Yes, integration of WAL and snapshots between Postgres and RocksDB is problematic. I also doubt that RocksDB can use the full power of Postgres extendable type system. > Then I think about implementing ideas of LSM using standard Postgres nbtree. > > We need two indexes: one small for fast inserts and another - big (main) > index. This top index is small enough to fit in memory > so inserts in this index are very fast. > Periodically we will merge data from top index to base index and truncate the > top index. To prevent blocking of inserts in the table > while we are merging indexes we can add ... on more index, which will be used > during merge. > > So final architecture of Lsm3 is the following: > two top indexes used in cyclic way and one main index. When top index reaches > some threshold value > we initiate merge with main index, done by bgworker and switch to another top > index. > As far as merging indexes is done in background, it doesn't affect insert > speed. > Unfortunately Postgres Index AM has not bulk insert operation, so we have to > perform normal inserts. > But inserted data is already sorted by key which should improve access > locality and partly solve random reads problem for base index. > > Certainly to perform search in Lsm3 we have to make lookups in all three > indexes and merge search results. > (in case of unique index we can avoid extra searches if searched value is > found in top index). > It can happen that during merge of top and base indexes the same TID can be > found in both of them. > But such duplicates can be easily eliminated during merge of search results. You use a fixed number of trees. Is this a limitation of prototype or intention of design? I guess if the base index is orders of magnitude bigger than RAM, this scheme can degrade greatly. > As far as we are using standard nbtree indexes there is no need to worry > about logging information in WAL. > There is no need to use inefficient "generic WAL records" or patch kernel by > adding own WAL records. As I get the merge operations are logged in the same way as ordinal inserts. This seems acceptable, but a single insert operation would eventually cause in times more WAL than it does in B-tree (especially if we'll have an implementation of a flexible number of trees). In principle that could be evaded if recovery could replay logic of trees merge on its side. But this approach hardly can fit Postgres in many ways. > I have performed the same benchmark with random inserts (described above) for > Lsm3: > > Index Clients TPS > Inclusive B-Tree 1 9387 > Inclusive B-Tree 10 18761 > RocksDB FDW 1 138350 > RocksDB FDW 10 122369 > RocksDB 1 166333 > RocksDB 10 141482 > Lsm3 1 151699 > Lsm3 10 65997 > > Size of nbtree is about 29Gb, single client performance is even higher than > of RocksDB FDW, but parallel results are signficantly worser. Did you investigate the source of degradation? Such degradation doesn't yet look inevitable for me. Probably, things could be improved. > I will be glad to receive any feedback, may be some change requests or > proposals. As I get you benchmarked just inserts. But what about vacuum? I think the way Postgres vacuum works for now isn't optimal for lsm. Postgres vacuum requires full scan of index, because it provides a bitmap of tids to be deleted without information of index keys. For lsm, it would be better if vacuum would push delete requests to the top level of lsm (as specially marked tuples of something). Thanks to that index deletes could be as efficient as inserts. This is especially important for lsm with many levels and/or aggressive vacuum. ------ Regards, Alexander Korotkov