Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables
Hi Tom, Andres, I've taken your patches, Tom, and applied them to a test deployment of my actual application. The most accessible way I have to reproduce this issue is to run a maintenance task that we typically run during a software deployment, which will remove some tenant schemas from our database system. I ran that task with and without the patches, and monitored how long logical replication stalled. This test was on a database with 22 schemas containing 400 tables each (~8800 tables). As a baseline, PostgreSQL 9.5.6 stalled completely for 21m 13s PostgreSQL 9.5.6 + all three patches stalled completely for 4m 11s. Definitely a fantastic improvement. :-) This test scenario (performing the maintenance task) is unfortunately very similar to the synthetic test... but it's the most consistent manner I've found to reproduce this stall. I've observed the replication stall in production at somewhat arbitrary times, so I'm not sure if there may be different performance profiles in more production-like usage. Here's a CPU performance sampling from the 4m period w/ the patched PostgreSQL. This is from the `perf` tool on Linux, rather than the earlier samples from my development machine on OSX. `perf report`: https://gist.github.com/mfenniak/9a3484c426a497f3903051d9fdf8b221 Raw perf data (49kb bzip2): https://www.dropbox.com/s/3jfxtg9kjzjztmp/linux-pg-walsender-patched.perf.data.bz2?dl=0 Mathieu On Wed, May 10, 2017 at 3:10 PM, Andres Freund <and...@anarazel.de> wrote: > Hi Tom, Mathieu, > > On 2017-05-10 17:02:11 -0400, Tom Lane wrote: > > Mathieu Fenniak <mathieu.fenn...@replicon.com> writes: > > > Andres, it seems like the problem is independent of having large data > > > manipulations mixed with schema changes. The below test case > demonstrates > > > it with just schema changes. > > > > > Tom, I've tested your patch, and it seems to have a positive impact for > > > sure. I've documented a test case to reproduce this issue (below), and > > > your avoid-search-in-catcache-invalidate.patch reduces the test case > time > > > from 63 seconds per run to 27 seconds per run. > > > > Thanks for the test case. > > Indeed! > > > > + 59.28%59.05%150349 postmaster postgres > [.] hash_seq_search > > +8.40% 8.36% 21250 postmaster postgres > [.] CallSyscacheCallbacks > > +6.37% 6.34% 16115 postmaster postgres > [.] LocalExecuteInvalidationMessage > > +5.69% 5.67% 14418 postmaster postgres > [.] CatCacheInvalidate > > +3.14% 3.12% 7942 postmaster postgres > [.] SysCacheInvalidate > > +1.72% 1.71% 4354 postmaster postgres > [.] ReorderBufferCommit > > +1.37% 1.33% 3512 postmaster postgres > [.] hash_search_with_hash_value > > +1.15% 1.15% 2913 postmaster postgres > [.] InvalidateCatalogSnapshot > > > > I looked at the hash_seq_search time a bit more, and realized that > > actually there's a pretty easy fix for that, which is to reduce > > the initial size of RelfilenodeMapHash from 1024 entries to 64. > > Heh, that's not a bad idea. Given the table resizes automatically, there > seems little reason not to go there. > > > You could quibble about where to set that exactly, but 1024 is > > just way too many --- in your example, there are never more than > > 5 entries in the hash, despite the presence of 1 tables in > > the database. > > That's likely because there's no DML. > > > > We're at a point of diminishing returns here; I think any further > > improvement would require reducing the number of invalidation calls, > > as Andres was speculating about upthread. Still, this shows that > > it doesn't take very much work to get a 10X improvement in the > > overhead associated with inval activities. We've never seen this > > overhead stick out quite this much before, and maybe logical replication > > will always be an atypical workload, but I think this may be worth > > committing even if Andres does managed to cut the number of calls. > > I've seen these callsites prominently in profiles not using logical > decoding. Never quite as big as here, but still. Temp table heavy > workloads IIRC are one of the easy way to trigger it. > > > > It would be interesting to see how much these patches help for your real > > use-case, as opposed to this toy example. Assuming that the res
Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables
Hi Andres, Tom, Andres, it seems like the problem is independent of having large data manipulations mixed with schema changes. The below test case demonstrates it with just schema changes. Tom, I've tested your patch, and it seems to have a positive impact for sure. I've documented a test case to reproduce this issue (below), and your avoid-search-in-catcache-invalidate.patch reduces the test case time from 63 seconds per run to 27 seconds per run. Here's an updated sample capture with the patch applied: https://gist.github.com/mfenniak/cef36699deb426efd8e433027553a37e Here's the test case I've been using for the above measurements: a) creating 10,000 tables in 200 schemas DO $$ DECLARE i INTEGER; j INTEGER; schema TEXT; BEGIN FOR i IN 1..200 LOOP schema := 'test_' || i; EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema); FOR j IN 1..500 LOOP EXECUTE format('CREATE TABLE IF NOT EXISTS %I.%I (f1 int, f2 int)', schema, 'test_' || j); END LOOP; END LOOP; END; $$; b) start logical replication w/ pg_recvlogical, printing out timestamps of messages and timestamps of transactions: pg_recvlogical --create-slot --start -f - --if-not-exists --plugin=test_decoding --slot=cputest --dbname=mathieu.fenniak --username=postgres --option=include-timestamp | gawk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }' c) drop a few schemas in a transaction begin; drop schema test_1 cascade; drop schema test_2 cascade; drop schema test_3 cascade; drop schema test_4 cascade; drop schema test_5 cascade; drop schema test_6 cascade; commit; Six schema drops had a 63s second delay between the actual commit and pg_recvlogical outputting it w/ stock 9.5.4: 2017-05-10 08:35:12 BEGIN 956 2017-05-10 08:36:15 COMMIT 956 (at 2017-05-10 08:35:12.106773-06) Six schema drops w/ Tom's patch (applied to 9.5.4) had a 27 second delay: 2017-05-10 08:48:08 BEGIN 942 2017-05-10 08:48:35 COMMIT 942 (at 2017-05-10 08:48:08.182015-06) (other test details: wal_level=logical; max_wal_senders=5; wal_sender_timeout = 3600s; max_locks_per_transaction = 65536; max_replication_slots = 5) On Fri, May 5, 2017 at 7:38 PM, Andres Freund <and...@anarazel.de> wrote: > On 2017-05-05 21:32:27 -0400, Tom Lane wrote: > > Andres Freund <and...@anarazel.de> writes: > > > On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote: > > >> It appears that most of the time is spent in the > > >> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache > > >> invalidation callbacks, both of which appear to be invalidating caches > > >> based upon the cache value. > > > > > I think optimizing those has some value (and I see Tom is looking at > > > that aspect, but the bigger thing would probably be to do fewer > lookups. > > > > I'm confused --- the lookup side of things is down in the noise in > > Mathieu's trace. > > Err, sorry. Completely mangled that sentence. Executing fewer > invalidations. We currently are likely re-executing the same set of > invalidations constantly in Mathieu's case. > > Background: When decoding a transaction during logical decoding we're > currently re-executing *all* a transaction's own cache invalidations, if > it has any, at every new command-id observed in the WAL stream. That's > because currently invalidations are only sent at commit, so we don't > know from "when" they are. But I think there's some very low-hanging > fruits reducing the frequency at which those are executed. > > In many cases where there's just a few schema changes in a transaction, > this doesn't hurt badly. But if you have a transaction that does a > bootload of schema changes *and* a has a lot of other changes, it gets > expensive. > > Mathieu: The above also indicates a possible workaround, you can try > separating larger amounts of data manipulations from schema changes, > into separate transactions. > > Greetings, > > Andres Freund >
[GENERAL] Logical decoding CPU-bound w/ large number of tables
Heyo, I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client. The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations. My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system. I've reproduced the same symptoms with two different approaches on my local machine. With both, I have a replication client connected via the streaming protocol. In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it. This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop. In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful". The walsender goes to 100% CPU and no data comes out of the replication stream for hours. I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value. Has anyone else run into this kind of performance problem? Any thoughts on how it might be resolved? I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance. Thanks all, *Mathieu Fenniak* | Senior Software Architect | Phone 1-587-315-1185 *Replicon* | The leader in cloud time tracking applications - 7,800+ Customers - 70+ Countries - 1.5 Million Users www.replicon.com | facebook <http://www.facebook.com/Replicon.inc> | linkedin <http://www.linkedin.com/company/27961?trk=tyah> | twitter <http://twitter.com/Replicon> | blog <http://www.replicon.com/blog/> | contact us <http://www.replicon.com/about_replicon/contact_us.aspx> *We are hiring!* | search jobs <http://www.replicon.com/careers>
[GENERAL] fast-archiver tool, useful for pgsql DB backups
Hi pgsql-general, Has anyone else ever noticed how slow it can be to rsync or tar a pgdata directory with hundreds of thousands or millions of files? I thought this could be done faster with a bit of concurrency, so I wrote a little tool called fast-archiver to do so. My employer (Replicon) has allowed me to release this tool under an open source license, so I wanted to share it with everyone. fast-archiver is written in Go, and makes uses of Go's awesome concurrency capabilities to read and write files in parallel. When you've got lots of small files, this makes a big throughput improvement. For a 90GB PostgreSQL database with over 2,000,000 data files, fast-archiver can create an archive in 27 minutes, as compared to tar in 1hr 23 min. Piped over an ssh connection, fast-archiver can transfer and write the same dataset on a gigabit network in 1hr 20min, as compared to rsync taking 3hrs for the same transfer. fast-archiver is available at GitHub: https://github.com/replicon/fast-archiver I hope this is useful to others. :-) Mathieu $ time fast-archiver -c -o /dev/null /db/data skipping symbolic link /db/data/pg_xlog 1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k 0inputs+0outputs (0major+1732minor)pagefaults 0swaps $ time tar -cf - /db/data | cat /dev/null tar: Removing leading `/' from member names tar: /db/data/base/16408/12445.2: file changed as we read it tar: /db/data/base/16408/12464: file changed as we read it 32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k 0inputs+0outputs (0major+5163minor)pagefaults 0swaps