On Wed, Jun 7, 2017 at 3:30 PM, Andres Freund <and...@anarazel.de> wrote: > > > > On June 7, 2017 11:29:28 AM PDT, "Fabrízio de Royes Mello" < fabriziome...@gmail.com> wrote: > >On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello < > >fabriziome...@gmail.com> wrote: > >> > >> > >> On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello < > >fabriziome...@gmail.com> wrote: > >> > > >> > Hi all, > >> > > >> > This week I faced a out of disk space trouble in 8TB production > >cluster. During investigation we notice that pg_replslot was the > >culprit > >growing more than 1TB in less than 1 (one) hour. > >> > > >> > We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a > >new > >9.6 instance and planning the upgrade soon. > >> > > >> > What I did? I freed some disk space just to startup PostgreSQL and > >begin the investigation. During the 'startup recovery' simply the files > >inside the pg_replslot was tottaly removed. So our trouble with 'out of > >disk space' disappear. Then the server went up and physical slaves > >attached > >normally to master but logical slaves doesn't, staying stalled in > >'catchup' > >state. > >> > > >> > At this moment the "pg_replslot" directory started growing fast > >again > >and forced us to drop the logical replication slot and we lost the > >logical > >slave. > >> > > >> > Googling awhile I found this thread [1] about a similar issue > >reported > >by Dmitriy Sarafannikov and replied by Andres and Álvaro. > >> > > >> > I ran the test case provided by Dmitriy [1] against branches: > >> > - REL9_4_STABLE > >> > - REL9_5_STABLE > >> > - REL9_6_STABLE > >> > - master > >> > > >> > After all test the issue remains... and also using the new Logical > >Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the > >"pg_replslot" was properly cleaned. The typo in > >ReorderBufferIterTXNInit > >complained by Dimitriy was fixed but the issue remains. > >> > > >> > Seems no one complain again about this issue and the thread was > >lost. > >> > > >> > The attached is a reworked version of Dimitriy's patch that seems > >solve > >the issue. I confess I don't know enough about replication slots code > >to > >really know if it's the best solution. > >> > > >> > Regards, > >> > > >> > [1] > > https://www.postgresql.org/message-id/1457621358.355011041%40f382.i.mail.ru > >> > > >> > >> Just adding Dimitriy to conversation... previous email I provided was > >wrong. > >> > > > >Does anyone have some thought about this critical issue? > > > > I plan to look into it over the next few days. >
Thanks... -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog: http://fabriziomello.github.io >> Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello >> Github: http://github.com/fabriziomello