On Tue, Jan 23, 2018 at 5:59 AM, Rambabu V <ram.wis...@gmail.com> wrote:
> > cat PostgreSQL-2018-01-23_060000.csv|grep FATAL What about ERROR, not just FATAL? Or grep for "out of memory" >> *$ free -mh* >> total used free shared buffers cached >> Mem: 58G 58G 358M 16G 3.6M 41G >> -/+ buffers/cache: 16G 42G >> Swap: 9.5G 687M 8.9G >> > This does not seem like it should be a problem. Is this data collected near the time of the failure? > work_mem = 256MB # min 64kB >> max_connections = 600 >> > These look pretty high, especially in combination. Why do you need that number of connections? Could you use a connection pooler instead? Or do just have an application bug (leaked connection handles) that needs to be fixed? Why do you need that amount of work_mem? > *ps -ef|grep postgres|grep idle|wc -l* >> 171 >> >> *ps -ef|grep postgres|wc -l* >> 206 >> > How close to the time of the problem was this recorded? How many of the idle are 'idle in transaction'? >> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >> 109063 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 39:55.61 >> postgres: test sss 10.20.2.228(55174) idle >> 24910 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 27:45.35 >> postgres: testl sss 10.20.2.228(55236) idle >> 115539 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 28:22.89 >> postgres: test sss 10.20.2.228(55184) idle >> 9816 postgres 20 0 16.7G 16.4G 16.3G S 0.0 27.8 40:19.57 >> postgres: test sss 10.20.2.228(55216) idle >> > How close to the time of the problem was this recorded? Nothing here seems to be a problem, because almost all the memory they have resident is shared memory. It looks like all your clients decide to run a memory hungry query simultaneously, consume a lot of work_mem, and cause a problem. Then by the time you notice the problem and start collecting information, they are done and things are back to normal. Cheers, Jeff