čt 14. 7. 2022 v 21:26 odesílatel Aleš Zelený <zeleny.a...@gmail.com>
napsal:

>
> st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <mich...@paquier.xyz>
> napsal:
>
>> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
>> > So far, it has happened three times (during a single week) from the
>> 14.3 ->
>> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
>> >
>> > Questions:
>> > 1)  Can we safely downgrade from 14.4 to 14.3 by shutting down the
>> instance
>> > and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
>> > 2) What is the best way to diagnose what is the root cause?
>>
>> Hmm.  14.4 has nothing in its release notes that would point to a
>> change in the vacuum or autovacuum's code paths:
>> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
>>
>> There is nothing specific after a look at the changes as of, and I am
>> not grabbing anything that would imply a change in memory context
>> handling either:
>> `git log --stat REL_14_3..REL_14_4`
>> `git diff REL_14_3..REL_14_4 -- *.c`
>>
>> Saying that, you should be able to downgrade safely as there are no
>> changes in WAL format or such that would break things.  Saying that,
>> the corruption issue caused by CONCURRENTLY is something you'd still
>> have to face.
>>
>>
> Thanks, good to know that, we can use it for a test case, since we already
> hit the CONCURRENTLY bug on 14.3.
>
> > 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR:  out
>> of
>> > memory
>> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL:
>> Failed
>> > on request of size 152094068 in memory context "TopTransactionContext".
>> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
>> >  automatic vacuum of table "prematch.replication.tab_queue_tmp"
>>
>> This is the interesting part.  Do you happen to use logical
>> replication in a custom C++ plugin?
>>
>
> We are using logical replication to other instances (pg_output) and
> decoderbufs
> https://github.com/debezium/postgres-decoderbufs for other applications.
>
>
>> > 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG:  could
>> > not fork worker process: Cannot allocate memory
>> > terminate called after throwing an instance of 'std::bad_alloc'
>> >   what():  std::bad_alloc
>> >
>> > DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
>> > terminate
>> > called after throwing an instance of 'std::bad_alloc' terminate called
>> > after throwing an instance of 'std::bad_alloc' what(): what():
>> > std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
>> > user=,db=,host=,app= LOG: background worker "parallel worker" (PID
>> 25251)
>> > was terminated by signal 6: Aborted
>> > 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
>> >  terminating any other active server processes
>>
>> Looks like something is going wrong in the memory handling of one of
>> your C++ extensions here.  If you can isolate an issue using a query
>> without any custom code, that would be a Postgres problem, but I think
>> that you are missing a trick in it.
>>
>
> Here are extensions installed in the database served by the cluster:
> prematch=# \dx
>                                             List of installed extensions
>         Name        | Version |   Schema   |
>  Description
>
> --------------------+---------+------------+------------------------------------------------------------------------
>  amcheck            | 1.3     | public     | functions for verifying
> relation integrity
>  dblink             | 1.2     | public     | connect to other PostgreSQL
> databases from within a database
>  file_fdw           | 1.0     | public     | foreign-data wrapper for flat
> file access
>  hstore             | 1.8     | public     | data type for storing sets of
> (key, value) pairs
>  hypopg             | 1.3.1   | public     | Hypothetical indexes for
> PostgreSQL
>  pageinspect        | 1.9     | public     | inspect the contents of
> database pages at a low level
>  pg_buffercache     | 1.3     | public     | examine the shared buffer
> cache
>  pg_stat_kcache     | 2.2.0   | public     | Kernel statistics gathering
>  pg_stat_statements | 1.9     | public     | track planning and execution
> statistics of all SQL statements executed
>  pgcrypto           | 1.3     | public     | cryptographic functions
>  pgstattuple        | 1.5     | public     | show tuple-level statistics
>  plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
>  plpgsql_check      | 2.1     | public     | extended check for plpgsql
> functions
>  postgres_fdw       | 1.1     | public     | foreign-data wrapper for
> remote PostgreSQL servers
>  tablefunc          | 1.0     | public     | functions that manipulate
> whole tables, including crosstab
> (15 rows)
>
> When we started experiencing these issues, based on Prometheus
> node-exporter statistics, I've begun collecting /proc/<pid>/status for all
> PG instance processes in an infinite loop.
> From this, I can extract some data. It looks like RssAnon memory is what I
> can see growing from several hundred kB to 1-2GB over some period (depends
> on workload, but usually 24 hours).
> The sessions where RssAnon memory growth are PHP client applications,
> having multiple connections it might consume 10GB of memory. The same
> application (I've asked for deployed changes, but no one was released close
> to the upgrade time).
> Comparing Prometheus server data,  RssAnon seems to directly follow
> node_memory_Active_anon_bytes node exporter metrics. If the "suspected"
> sessions disconnects, node_memory_Active_anon_bytes drops down.
> I've checked node_memory_Active_anon_bytes metric before the upgrade and
> (same application workload) it was pretty stable on PG 13.8, later 14.3 but
> immediately after the upgrade,  node_memory_Active_anon_bytes start growth,
> and once the Commit limit was reached, the fork()/Out of memory issues
> appears.
>
> So for the application user running the PHP, I've set
> log_min_duration_statemenet = 0, but no unexpected statements or arguments
> (some functions have int array argument) were found in the logfile.
>
> Checking the RssAnon from proc/pid/status I've found some points where
> RssAnon memory usage grew very steep for a minute, but no "suspicious"
> queries/arguments were found in the instance logfile.
>
> Any hint, on how to get the root cause would be appreciated since so far
> I've failed to isolate the issue reproducible testcase.
> At least I hope that looking for the RssAnon process memory is an
> appropriate metric, if not, let me know and I'll try to update the
> monitoring to get the root cause.
>
> I can imagine a workaround with client application regular reconnect...,
> but u to 14.3 it works, so I'd like to fix the issue either on our
> application side or at PG side if it is a PG problem.
>

Theoretically, there can be memory leak in libpq. There were some changes.

Regards

Pavel


> Kind regards Ales Zeleny
>
> --
>> Michael
>>
>

Reply via email to