Re: [libpqxx-general] SIGPIPE received when connection is lost because the server is down

Leandro Lucarella Wed, 15 Nov 2006 09:03:14 -0800

Jeroen T. Vermeulen escribió:
> On Wed, November 15, 2006 07:03, Leandro Lucarella wrote:
> 
>> 1) The first problem I found is that when I stop the postgresql server
>> muy program receives a SIGPIPE when doing, for example, a
>> transaction.exec(). I was expecting an exception.
> 
> I gather you're running the backend and your program on the same machine,


Not allways, I use it both on the same machine and on another one, both 
with the same problem.

> otherwise you shouldn't see SIGPIPE at all.  The way it normally works is
> this:
> 
> 1. Your backend goes down, dropping its end of the connecting socket.
> 
> 2. The C API, libpq, gets an error return code on the next attempt to use
> the socket and handles it by noting that the connection has died.

No, the C API, libpq, does not use MSG_NOPIPE when send()ing and 
recv()ing (I've checked the source code), so when the other end of the 
connection goes down, a SIGPIPE signal is sent to the process.
The only way (I know) libpq could return an error code when this happend 
is adding MSG_NOPIPE flag to send() and recv() calls.

> 3. When libpqxx sees this, it throws broken_connection.  It doesn't
> involve itself with signals at all, which helps portability and reduces
> the risk of interfering with your program.

Of course is not libpqxx who is raising the signal, is the OS itself.

> So the place to look for detailed documentation on signal handling is the
> libpq documentation.  But I guess this is an issue that the libpqxx docs
> should at least mention.

Agree, I've checked the FAQ, The Toubleshooting section[1] (question 
"Why does my program crash when it fails to connect to the database?") 
and doesn't mention this issue.

[1] http://thaiopensource.org/development/libpqxx/wiki/FaqTroubleshooting

> I haven't tried killing the backend while a libpq/libpqxx client was
> locally connected, so I haven't run across the SIGPIPE.

I insist this has nothing to do with running locally or remotely (at 
least if "backend" is what I think it is, the postgresql server, but 
maybe I'm wrong, I'm new to postgresql).

> As long as you
> don't let it terminate your program, however (just set it to SIG_IGN, for
> example) you should get the exception you were expecting.

Yes, that's what I plan to do, but I wanted to check if is there any 
more elegant solution, to tell libpqxx to tell libpq to use MSG_NOPIPE 
=) and/or to check if this is a known issue and to collect others 
experience.

>> 3) When the connection is lost because a network problem, the libpqxx
>> methods (like transaction.exec()) keeps waiting way too long for the
>> connection to be reestablished and then fails after a long time with
>> SIGPIPE again (but without the "FATAL" error message).
> 
> This can be a symptom of two known problems:
> 
> 1. There used to be a bug in libpq where only "broken pipe" was recognized
> as terminating a connection, but there's a separate error code for
> timeouts.
> 
> This one was actually discovered as a result of another libpqxx user
> running into the long timeout, so libpqxx could possibly be doing
> something to make it worse.  There certainly is a lot of retry logic in
> there.  On the other hand it was libpqxx's error handling that made it
> possible to pinpoint the problem, so perhaps that is the reason it wasn't
> fixed before.
> 
> IIRC the bug was fixed in updates of all supported major versions around
> the time 8.1 came out.

I'm using postgresql 8.1 and libpqxx 2.6.8. I can discard this possibility?

> 2. Your OS may simply be taking a long time to give up on a network
> connection.  There's nothing I can do about it, but you can.  See below.

Well, there is. A lot of programs use a keep-alive to test the 
connection bypassing the long TCP timeouts. Its a hack, I know, but is 
all an application layer can do with TCP =)

>> I know none of this problems are really from libpqxx: 1) is because
>> libpq don't use MSG_NOSIGNAL flag when send()ing or recv()ing data with
>> the socket (I know this is probably a feature, not a bug, but I think it
>> would be great and much more C++-friendly if you could raise an
>> exception instead of catching a singal).
> 
> It would, but libpqxx already throws the exception and all your program
> should need to do is stop the program from terminating when the signal
> arrives.  I think it's really up to the main program to decide what to do
> about signals.  If every library it links in feels free to mess with
> signal handling, where does it end?

Is not exactly "mess with signal handling" in the sense you don't even 
need to install a signal handler to avoid this, you just have to add a 
flag to send() and recv(). The library provides a layer of abstraction, 
and I don't care if the connection is out because a SIGPIPE or what, I 
don't even care if the connection use TCP or a message queue or shared 
memory to talk to the server. All I care is the connection is lost, and 
this should be informed with an exception no matter what method are you 
using to talk to the server.

But I now this is a hard topic to agree on, and anyways is not a libpqxx 
issue (or is not an issue libpqxx could fix without support in libpq).

> So I think the best I can do about this is to document it.

Fair enough.

>> 2) is again libpq's fault, but
>> is there any way to tell libpq to be quiet?
> 
> Sure.  Just create a nonnoticer object and pass an auto_ptr referencing it
> to the connection's set_noticer() function.

Great! Thanks.

>> 3) I guess is just TCP's
>> fault to be so badass and wait that long, but what about a keep-alive +
>> TTL to try to figure out when the connection is lost in a shorter time
>> (like the "connect_timeout" parameter in the connection string, which
>> works only when connecting, but not when doing a query for example).
> 
> There are ways of changing how your kernel sees timeouts without messing
> with the IP packets, but they'll be OS-dependent. See ip(7) and tcp(7).
> 
> 
>> I'm open to suggestions, both workarrounds for my code and enhancements
>> to libpqxx/libpq.
> 
> Three recommendations: set SIG_PIPE to SIG_IGN; ensure your libpq is up to
> date; and if you still have the slow timeouts after that, mess with your
> networking stack (very carefully of course) to make it give up faster.

So the keep-alive solution is discarded? I don't like to mess arround 
with the TCP general configuration because postgres is not the only 
service in the machine and I other services don't need so short timeouts.

Thanks for your time.

-- 
Leandro Lucarella
Integratech S.A.
4571-5252
_______________________________________________
Libpqxx-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/libpqxx-general

Re: [libpqxx-general] SIGPIPE received when connection is lost because the server is down

Reply via email to