The first thing I noticed was that my nameserver had gone.
I searched for the reason and found:
>Jul 15 04:04:52 <kern.crit> edge kernel: swap_pager_getswapspace(3): failed
< ... hundreds more of these ... >
>Jul 15 04:05:07 <kern.err> edge kernel: pid 47113 (named), uid 53, was
killed: out of swap space
That didn't make sense - the machine has enough swapspace.
But since this did repeat every other night, I started logging
ps output minutely.
And so I found a postgres database backup going weird:
03:23 70 78433 78432 0 96 0 8220 4196 - R ?? 0:22.84
pg_dump -b
< ... >
03:49 70 78433 78432 0 96 0 8220 4024 - R ?? 17:06.61
pg_dump -b
03:50 70 78433 78432 0 96 0 8220 4024 - R ?? 17:46.15
pg_dump -b
03:51 70 78433 78432 0 96 0 8220 4024 - R ?? 18:26.69
pg_dump -b
03:52 70 78433 78432 0 47 0 139292 57888 select S ?? 18:37.65
pg_dump -b
03:53 70 78433 78432 0 48 0 139292 57828 select S ?? 18:40.36
pg_dump -b
03:54 70 78433 78432 0 -20 0 401436 69092 swread DL ?? 18:42.49
pg_dump -b
03:55 70 78433 78432 0 -20 0 401436 63232 swread DL ?? 18:43.99
pg_dump -b
That process starts with 8MB memory, and runs so for half an hour,
then suddenly between 03:51 and 03:52 memory usage explodes.
And in that night it did not run out of swap space - instead it gave an
error message:
>pg_dump: Error message from server: lost synchronization with server:
> got message type "0", length 154143043
>pg_dump: The command was: COPY public.file (fileid, fileindex, jobid,
> pathid, filenameid, markid, lstat, md5) TO stdout;
But that database backup is at that time quite in the middle of
dumping a db table containing lots of small records - there is no
reason why a 154 MB "message" should be transferred between server
and client while copying records of ~60 Bytes each.
One other thing did happen between 03:51 and 03:52 - the DSL
internet connection did disconnect/reconnect and obtained a new
IP adress. Afterwards, a script does flush and reload an ipfw table()
with the new local adresses - and during this process one(!) packet
of the database session was dropped.
I could verify that relation: every night when there were memory
problems, few packets from the database backup were lost during the
firewall reconfigure - in nights when no packets were lost, there were
no memory problems.
I will now change the firewall handling to get rid of that packet loss,
but also, I need some refresh on how TCP works:
I thought TCP would not be disturbed by a lost packet, but would
automatically resend that packet until ACK received; and I thought
this would happen below the application, so practically the application
CANNOT go weird from a lost packet...
Is there any reason why this would not be true on a localhost connection?
rgds,
PMc
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"