Re: Improve eviction algorithm in ReorderBuffer

Shubham Khanna Thu, 01 Feb 2024 20:59:26 -0800

On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> >
> > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > wrote:
> > >
> > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapil...@gmail.com> 
> > > wrote:
> > > >
> > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila 
> > > > > <amit.kapil...@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > The individual transactions shouldn't cross
> > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal 
> > > > > > to
> > > > > > maintain the lists: "...splitting it into two lists: transactions
> > > > > > consuming 5% < and 5% >=  of the memory limit, and checking the 5% 
> > > > > > >=
> > > > > > list preferably.". In the previous sentence, what did you mean by
> > > > > > transactions consuming 5% >= of the memory limit? I got the 
> > > > > > impression
> > > > > > that you are saying to maintain them in a separate transaction list
> > > > > > which doesn't seems to be the case.
> > > > >
> > > > > I wanted to mean that there are three lists in total: the first one
> > > > > maintain the transactions consuming more than 10% of
> > > > > logical_decoding_work_mem,
> > > > >
> > > >
> > > > How can we have multiple transactions in the list consuming more than
> > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization
> > > > before any xact reaches logical_decoding_work_mem?
> > >
> > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions
> > > consuming more than 6.4MB are added to the list. So for example, it's
> > > possible that the list has three transactions each of which are
> > > consuming 10MB while the total memory usage in the reorderbuffer is
> > > still 30MB (less than logical_decoding_work_mem).
> > >
> >
> > Thanks for the clarification. I misunderstood the list to have
> > transactions greater than 70.4 MB (64 + 6.4) in your example. But one
> > thing to note is that maintaining these lists by default can also have
> > some overhead unless the list of open transactions crosses a certain
> > threshold.
> >
>
> On further analysis, I realized that the approach discussed here might
> not be the way to go. The idea of dividing transactions into several
> subgroups is to divide a large number of entries into multiple
> sub-groups so we can reduce the complexity to search for the
> particular entry. Since we assume that there are no big differences in
> entries' sizes within a sub-group, we can pick the entry to evict in
> O(1). However, what we really need to avoid here is that we end up
> increasing the number of times to evict entries because serializing an
> entry to the disk is more costly than searching an entry on memory in
> general.
>
> I think that it's no problem in a large-entries subgroup but when it
> comes to the smallest-entries subgroup, like for entries consuming
> less than 5% of the limit, it could end up evicting many entries. For
> example, there would be a huge difference between serializing 1 entry
> consuming 5% of the memory limit and serializing 5000 entries
> consuming 0.001% of the memory limit. Even if we can select 5000
> entries quickly, I think the latter would be slower in total. The more
> subgroups we create, the more the algorithm gets complex and the
> overheads could cause. So I think we need to search for the largest
> entry in order to minimize the number of evictions anyway.
>
> Looking for data structures and algorithms, I think binaryheap with
> some improvements could be promising. I mentioned before why we cannot
> use the current binaryheap[1]. The missing pieces are efficient ways
> to remove the arbitrary entry and to update the arbitrary entry's key.
> The current binaryheap provides binaryheap_remove_node(), which is
> O(log n), but it requires the entry's position in the binaryheap. We
> can know the entry's position just after binaryheap_add_unordered()
> but it might be changed after heapify. Searching the node's position
> is O(n). So the improvement idea is to add a hash table to the
> binaryheap so that it can track the positions for each entry so that
> we can remove the arbitrary entry in O(log n) and also update the
> arbitrary entry's key in O(log n). This is known as the indexed
> priority queue. I've attached the patch for that (0001 and 0002).
>
> That way, in terms of reorderbuffer, we can update and remove the
> transaction's memory usage in O(log n) (in worst case and O(1) in
> average) and then pick the largest transaction in O(1). Since we might
> need to call ReorderBufferSerializeTXN() even in non-streaming case,
> we need to maintain the binaryheap anyway. I've attached the patch for
> that (0003).
>
> Here are test script for many sub-transactions case:
>
> create table test (c int);
> create or replace function testfn (cnt int) returns void as $$
> begin
>   for i in 1..cnt loop
>     begin
>       insert into test values (i);
>     exception when division_by_zero then
>       raise notice 'caught error';
>       return;
>     end;
>   end loop;
> end;
> $$
> language plpgsql;
> select pg_create_logical_replication_slot('s', 'test_decoding');
> select testfn(50000);
> set logical_decoding_work_mem to '4MB';
> select count(*) from pg_logical_slot_peek_changes('s', null, null)";
>
> and here are results:
>
> * HEAD: 16877.281 ms
> * HEAD w/ patches (0001 and 0002): 655.154 ms
>
> There is huge improvement in a many-subtransactions case.


I have run the same test and found around 12.53x improvement(the
median of five executions):
HEAD        | HEAD+ v2-0001+ v2-0002 + v2-0003 patch
29197ms   | 2329ms

I had also run the regression test that you had shared at [1], there
was a very very slight dip in this case around it takes around 0.31x
more time:
HEAD        | HEAD + v2-0001+ v2-0002 + v2-0003 patch
4459ms     | 4473ms

The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7
Operating System. Also find the detailed info of the performance
machine attached.

[1] - 
https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com

Thanks and Regards,
Shubham Khanna.

MemTotal: 755.536 GB
MemFree: 748.281 GB
MemAvailable: 748.581 GB
Buffers: 0.002 GB
Cached: 1.366 GB
SwapCached: 0.000 GB
Active: 0.834 GB
Inactive: 0.745 GB
Active(anon): 0.221 GB
Inactive(anon): 0.028 GB
Active(file): 0.614 GB
Inactive(file): 0.717 GB
Unevictable: 0.000 GB
Mlocked: 0.000 GB
SwapTotal: 4.000 GB
SwapFree: 4.000 GB
Dirty: 0.000 GB
Writeback: 0.000 GB
AnonPages: 0.212 GB
Mapped: 0.142 GB
Shmem: 0.038 GB
Slab: 0.468 GB
SReclaimable: 0.139 GB
SUnreclaim: 0.329 GB
KernelStack: 0.020 GB
PageTables: 0.023 GB
NFS_Unstable: 0.000 GB
Bounce: 0.000 GB
WritebackTmp: 0.000 GB
CommitLimit: 381.768 GB
Committed_AS: 0.681 GB
VmallocTotal: 32768.000 GB
VmallocUsed: 1.852 GB
VmallocChunk: 32189.748 GB
Percpu: 0.035 GB
HardwareCorrupted: 0.000 GB
AnonHugePages: 0.025 GB
CmaTotal: 0.000 GB
CmaFree: 0.000 GB
HugePages_Total: 0.000 GB
HugePages_Free: 0.000 GB
HugePages_Rsvd: 0.000 GB
HugePages_Surp: 0.000 GB
Hugepagesize: 0.002 GB
DirectMap4k: 0.314 GB
DirectMap2M: 6.523 GB
DirectMap1G: 763.000 GB

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
stepping        : 7
microcode       : 0x715
cpu MHz         : 1762.646
cache size      : 38400 KB
physical id     : 0
siblings        : 30
core id         : 0
cpu cores       : 15
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs 
ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt 
dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
bogomips        : 5586.07
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.8 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/";
BUG_REPORT_URL="https://bugzilla.redhat.com/";

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"

Re: Improve eviction algorithm in ReorderBuffer

Reply via email to