> On 22 May 2018 at 20:59, Andres Freund <[email protected]> wrote:
> On 2018-05-22 20:54:46 +0200, Dmitry Dolgov wrote:
>> > On 22 May 2018 at 18:47, Andres Freund <[email protected]> wrote:
>> > On 2018-05-22 08:57:18 -0700, Andres Freund wrote:
>> >> Hi,
>> >>
>> >>
>> >> On 2018-05-22 17:37:28 +0200, Dmitry Dolgov wrote:
>> >> > Thanks for the patch. Out of curiosity I tried to play with it a bit.
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> > `pgbench -i -s 100` actually hang on my machine, because the
>> >> > copy process ended up with waiting after `pg_uds_send_with_fd`
>> >> > had
>> >>
>> >> Hm, that had worked at some point...
>> >>
>> >>
>> >> > errno == EWOULDBLOCK || errno == EAGAIN
>> >> >
>> >> > as well as the checkpointer process.
>> >>
>> >> What do you mean with that latest sentence?
>>
>> To investigate what's happening I attached with gdb to two processes, COPY
>> process from pgbench and checkpointer (since I assumed it may be involved).
>> Both were waiting in WaitLatchOrSocket right after SendFsyncRequest.
>
> Huh? Checkpointer was in SendFsyncRequest()? Coudl you share the
> backtrace?
Well, that's what I've got from gdb:
#0 0x00007fae03fae9f3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000000000077a979 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffe37529ec0, cur_timeout=-1, set=0x23cddf8) at
latch.c:1048
#2 WaitEventSetWait (set=set@entry=0x23cddf8,
timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffe37529ec0,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=0) at
latch.c:1000
#3 0x000000000077ad08 in WaitLatchOrSocket
(latch=latch@entry=0x0, wakeEvents=wakeEvents@entry=4, sock=8,
timeout=timeout@entry=-1, wait_event_info=wait_event_info@entry=0) at
latch.c:385
#4 0x00000000007152cb in SendFsyncRequest
(request=request@entry=0x7ffe37529f40, fd=fd@entry=-1) at
checkpointer.c:1345
#5 0x0000000000716223 in AbsorbAllFsyncRequests () at checkpointer.c:1207
#6 0x000000000079a5f0 in mdsync () at md.c:1339
#7 0x000000000079c672 in smgrsync () at smgr.c:766
#8 0x000000000076dd53 in CheckPointBuffers (flags=flags@entry=64)
at bufmgr.c:2581
#9 0x000000000051c681 in CheckPointGuts
(checkPointRedo=722254352, flags=flags@entry=64) at xlog.c:9079
#10 0x0000000000523c4a in CreateCheckPoint (flags=flags@entry=64)
at xlog.c:8863
#11 0x0000000000715f41 in CheckpointerMain () at checkpointer.c:494
#12 0x00000000005329f4 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe3752a220) at bootstrap.c:451
#13 0x0000000000720c28 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5340
#14 0x0000000000721c23 in reaper (postgres_signal_arg=<optimized
out>) at postmaster.c:2875
#15 <signal handler called>
#16 0x00007fae03fa45b3 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:84
#17 0x0000000000722968 in ServerLoop () at postmaster.c:1679
#18 0x0000000000723cde in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x23a00e0) at postmaster.c:1388
#19 0x000000000068979f in main (argc=3, argv=0x23a00e0) at main.c:228
>> >> > Looks like with the default
>> >> > configuration and `max_wal_size=1GB` it writes more than reads to a
>> >> > socket, and a buffer eventually becomes full.
>> >>
>> >> That's intended to then wake up the checkpointer immediately, so it can
>> >> absorb the requests. So something isn't right yet.
>> >
>> > Doesn't hang here, but it's way too slow.
>>
>> Yep, in my case it was also getting slower, but eventually hang.
>>
>> > Reason for that is that I've wrongly resolved a merge conflict. Attached
>> > is a
>> > fixup patch - does that address the issue for you?
>>
>> Hm...is it a correct patch? I see the same committed in
>> 8c3debbbf61892dabd8b6f3f8d55e600a7901f2b, so I can't really apply it.
>
> Yea, sorry for that. Too many files in my patch directory... Right one
> attached.
Yes, this patch solves the problem, thanks.