Re: Testing Stick table replication with snapshot 20120222

Willy Tarreau Sat, 10 Mar 2012 05:01:36 -0800

Hi Mark,

On Thu, Feb 23, 2012 at 03:48:08PM +0000, Mark Brooks wrote:
> Thanks Willy, We have re-tested the replication across haproxy
> reload/restart and it appears it was working as you suggested. So
> apologies there.


You don't have to apologize, you might have encountered a real bug which
only appears once in a while. As I often say, reporting uncertain bugs
is better than nothing, at least it can suggest other people to report
a "me too".

Also, at Exceliance during some preliminary native-SSL tests, one of our
engineers noticed a bug which could possibly affect peers replication
after some error scenarios occur. It looks like if some errors happen
on the connection after a full replication, next connections will not
necessarily restart replication. It might be what you've observed. The
fix has been pushed into the master tree and I'm planning on a -dev8
next week as enough fixes are stacked there.

> We have seen that when restarting or reloading the table syncs between
> 2 processes on the same box and also when it syncs to a remote peer
> that the persistence timeout counter is reset to the maximum value and
> not carried with it.
> Is it possible to request the persistence timeout entries counters
> sync across this restart/reload?

No, timers are not exchanged, only the server ID. A number of other things
would need to be synced (eg: counters, etc...) but that's still quite
difficult to do, so for now sessions are refreshed upon synchronization
just as if there was activity on them.

> It has however raised another question - How best to clear the tables
> on all appliances at the same time.

I unfortunately have no solution to this problem right now and I know
for sure that it can be annoying sometimes. It's not even haproxy-specific,
it's a general problem of how to make an information disappear from a global
system when it's replicated in real time and you can only destroy it on a
single node at a time. Some solutions would possibly involve sending deletion
orders to other nodes or just updating their expiration timer, I don't know
for now. I think that it will be easier or at least less critical when the
expiration timers are shared!

(...)
> The only thing we have been able to come up with so far is to put each
> of the backend servers in maintenance mode first so they stop
> accepting new connections then clear the tables then bring them back
> on-line again.

I think you could proceed differently : break the replication between
the nodes, clear all tables then reopen replication. At least it would
not block user access nor traffic.

Regards,
Willy

Re: Testing Stick table replication with snapshot 20120222

Reply via email to