On Tue, Jan 23, 2018 at 5:59 AM, Rambabu V <ram.wis...@gmail.com> wrote:
> > cat PostgreSQL-2018-01-23_060000.csv|grep FATAL
What about ERROR, not just FATAL? Or grep for "out of memory"
>> *$ free -mh*
>> total used free shared buffers cached
>> Mem: 58G 58G 358M 16G 3.6M 41G
>> -/+ buffers/cache: 16G 42G
>> Swap: 9.5G 687M 8.9G
This does not seem like it should be a problem. Is this data collected
near the time of the failure?
> work_mem = 256MB # min 64kB
>> max_connections = 600
These look pretty high, especially in combination. Why do you need that
number of connections? Could you use a connection pooler instead? Or do
just have an application bug (leaked connection handles) that needs to be
fixed? Why do you need that amount of work_mem?
> *ps -ef|grep postgres|grep idle|wc -l*
>> *ps -ef|grep postgres|wc -l*
How close to the time of the problem was this recorded? How many of the
idle are 'idle in transaction'?
>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>> 109063 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 39:55.61
>> postgres: test sss 10.20.2.228(55174) idle
>> 24910 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 27:45.35
>> postgres: testl sss 10.20.2.228(55236) idle
>> 115539 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 28:22.89
>> postgres: test sss 10.20.2.228(55184) idle
>> 9816 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 40:19.57
>> postgres: test sss 10.20.2.228(55216) idle
How close to the time of the problem was this recorded? Nothing here seems
to be a problem, because almost all the memory they have resident is shared
It looks like all your clients decide to run a memory hungry query
simultaneously, consume a lot of work_mem, and cause a problem. Then by
the time you notice the problem and start collecting information, they are
done and things are back to normal.