On Sun, Jun 20, 2021 at 6:56 AM David Rowley <dgrowle...@gmail.com> wrote:
> On Wed, 14 Aug 2019 at 19:25, David Rowley <david.row...@2ndquadrant.com> > wrote: > > For now, I'm out of ideas. If anyone else feels like suggesting > > something of picking this up, feel free. > > This is a pretty old thread, so we might need a recap: > > # Recap > > Basically LockReleaseAll() becomes slow after a large number of locks > have all been held at once by a backend. The problem is that the > LockMethodLocalHash dynahash table must grow to store all the locks > and when later transactions only take a few locks, LockReleaseAll() is > slow due to hash_seq_search() having to skip the sparsely filled > buckets in the bloated hash table. > > The following things were tried on this thread. Each one failed: > > 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK. > Simply loop over the dlist in LockReleaseAll(). > 2) Try dropping and recreating the dynahash table if it becomes > bloated using some heuristics to decide what "bloated" means and if > recreating is worthwhile. > > #1 failed due to concerns with increasing the size of LOCALLOCK to > store the dlist pointers. Performance regressions were seen too. > Possibly due to size increase or additional overhead from pushing onto > the dlist. > #2 failed because it was difficult to agree on what the heuristics > would be and we had no efficient way to determine the maximum number > of locks that a given transaction held at any one time. We only know > how many were still held at LockReleaseAll(). > > There were also some suggestions to fix dynahash's hash_seq_search() > slowness, and also a suggestion to try using simplehash.h instead of > dynahash.c. Unfortunately simplehash.h would suffer the same issues as > it too would have to skip over empty buckets in a sparsely populated > hash table. > > I'd like to revive this effort as I have a new idea on how to solve the > problem. > > # Background > > Over in [1] I'm trying to improve the performance of smgropen() during > recovery. The slowness there comes from the dynahash table lookups to > find the correct SMgrRelation. Over there I proposed to use simplehash > instead of dynahash because it's quite a good bit faster and far > lessens the hash lookup overhead during recovery. One problem on that > thread is that relcache keeps a pointer into the SMgrRelation > (RelationData.rd_smgr) and because simplehash moves things around > during inserts and deletes, then we can't have anything point to > simplehash entries, they're unstable. I fixed that over on the other > thread by having the simplehash entry point to a palloced > SMgrRelationData... My problem is, I don't really like that idea as it > means we need to palloc() pfree() lots of little chunks of memory. > > To fix the above, I really think we need a version of simplehash that > has stable pointers. Providing that implementation is faster than > dynahash, then it will help solve the smgropen() slowness during > recovery. > > # A new hashtable implementation > > I ended up thinking of this thread again because the implementation of > the stable pointer hash that I ended up writing for [1] happens to be > lightning fast for hash table sequential scans, even if the table has > become bloated. The reason the seq scans are so fast is that the > implementation loops over the data arrays, which are tightly packed > and store the actual data rather than pointers to the data. The code > does not need to loop over the bucket array for this at all, so how > large that has become is irrelevant to hash table seq scan > performance. > > The patch stores elements in "segments" which is set to some power of > 2 value. When we run out of space to store new items in a segment, we > just allocate another segment. When we remove items from the table, > new items reuse the first unused item in the first segment with free > space. This helps to keep the used elements tightly packed. A > segment keeps a bitmap of used items so that means scanning all used > items is very fast. If you flip the bits in the used_item bitmap, > then you get a free list for that segment, so it's also very fast to > find a free element when inserting a new item into the table. > > I've called the hash table implementation "generichash". It uses the > same preprocessor tricks as simplehash.h does and borrows the same > linear probing code that's used in simplehash. The bucket array just > stores the hash value and a uint32 index into the segment item that > stores the data. Since segments store a power of 2 items, we can > easily address both the segment number and the item within the segment > from the single uint32 index value. The 'index' field just stores a > special value when the bucket is empty. No need to add another field > for that. This means the bucket type is just 8 bytes wide. > > One thing I will mention about the new hash table implementation is > that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means > that's the minimum size for the table. I could drop this downto 64, > but that's still quite a bit more than the default size of the > dynahash table of 16. I think 16 is a bit on the low side and that it > might be worth making this 64 anyway. I'd just need to lower > GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go > lower as I've done nothing to allow partial bitmap words, they're > 64-bits on a 64-bit machine. > > I've not done too much benchmarking between it and simplehash.h, but I > think in some cases it will be faster. Since the bucket type is just 8 > bytes, moving stuff around during insert/deletes will be cheaper than > with simplehash. Lookups are likely to be a bit slower due to having > to lookup the item within the segment, which is a few pointer > dereferences away. > > A quick test shows an improvement when compared to dynahash. > > # select count(pg_try_advisory_lock(99999,99999)) from > generate_series(1,1000000); > > Master: > Time: 190.257 ms > Time: 189.440 ms > Time: 188.768 ms > > Patched: > Time: 182.779 ms > Time: 182.267 ms > Time: 186.902 ms > > This is just hitting the local lock table. The advisory lock key is > the same each time, so it remains a lock check. Also, it's a pretty > small gain, but I'm mostly trying to show that the new hash table is > not any slower than dynahash for probing for inserts. > > The real wins come from what we're trying to solve in this thread -- > the performance of LockReleaseAll(). > > Benchmarking results measuring the TPS of a simple select from an > empty table after another transaction has bloated the locallock table. > > Master: > 127544 tps > 113971 tps > 123255 tps > 121615 tps > > Patched: > 170803 tps > 167305 tps > 165002 tps > 171976 tps > > About 38% faster. > > The benchmark I used was: > > t1.sql: > \set p 1 > select a from t1 where a = :p > > hp.sql: > select count(*) from hp > > "hp" is a hash partitioned table with 10k partitions. > > pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 > postgres > > I'm using the query to the hp table to bloat the locallock table. It's > only executed every 1 in 100,000 queries. The tps numbers above are > the ones to run t1.sql > > I've not quite looked into why yet, but the hp.sql improved > performance by 58%. It went from an average of 1.061377 in master to > an average of 1.683616 in the patched version. I can't quite see where > this gain is coming from. It's pretty hard to get good stable > performance results out of this AMD machine, so it might be related to > that. That's why I ran 20 threads. It seems slightly better. The > machine seems to have trouble waking up properly for a single thread. > > It would be good if someone could repeat the tests to see if the gains > appear on other hardware. > > Also, it would be good to hear what people think about solving the > problem this way. > > Patch attached. > > David > > [1] > https://www.postgresql.org/message-id/flat/caaphdvpkwoglh_byg7jproxn8b2g2t9dwdcqsmksxg5+wwz...@mail.gmail.com Hi, + * GH_ELEMENT_TYPE defines the data type that the hashtable stores. Each + * instance of GH_ELEMENT_TYPE which is stored in the hash table is done so + * inside a GH_SEGMENT. I think the second sentence can be written as (since done means stored, it is redundant): Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT. + * Macros for translating a bucket's index into the segment and another to + * determine the item number within the segment. + */ +#define GH_INDEX_SEGMENT(i) (i) / GH_ITEMS_PER_SEGMENT into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight confusion). + GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item + * in the items array */ 'A 1-bit' -> One bit (A and 1 mean the same) + uint32 first_free_segment; Since the segment may not be totally free, maybe name the field first_segment_with_free_slot + * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on + * each GH_BITMAP_WORD. It seems the only difference from GH_NEXT_ONEBIT is in this line: + GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */ If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified. + * next insert will store in this segment. If it's already pointing to an + * earlier segment, then leave it be. The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same time. Maybe drop the last sentence. Cheers