On 12/24/2009 at 03:34 AM, Jiaju Zhang <[email protected]> wrote: 
> Hi all, 
>  
> Thank you very much for all the suggestions and comments. Here are 
> some other thought about this feature. Also RFC :) 
>  
> My original thought is to write a simple implementation of tickle ACK 
> to cater to most of the user's need, but now realized the real 
> deployment of HA solution is so complicated and I'm not sure if the 
> solution we have talked above can really achieve that goal. 
>  
> One problem is if the user who really needs this feature will 
> basically have a shared storage or a configured DRBD?

Anyone doing HA file-level storage (NFS, Samba, etc.) will by
definition have something suitable available (although, they'll
want to configure it such that the tickle directory is not
exported/shared to clients).

I expect the same to be true for HA MySQL (your databases are on
a shared/drbd filesystem).

There may not be anything suitable available for HA block-level
storage; if all you're doing is exporting iSCSI targets, you may not
have a shared filesystem to speak of, just a whole lot of opaque
blobs that look like block devices.

Likewise, an HA database that is backed by a raw block device
rather than a filesystem won't give us what we need (do we have
any of these?  What does Oracle do?)

Same for VMs; they could be on a shared filesystem (useful), or
just on block devices (not useful).

Open question: of the above examples, how many need tickle ACKs?
File and block level storage can certainly benefit.  Presumably
databases where clients generally open persistent connections
benefit.  What about VMs?

Does anyone else have any other examples?

> If most of the 
> users do have, I think we can go forward. If not, we can considering 
> to implement this like calling Corysnc/openAIS API to sync the TCP 
> connections information in the cluster. It is a little expensive since 
> we just want to sync some information only related to this feature. 
> And if a single service group(I mean not the clone scenario like 
> cluster ip) is the most use-cases, we needn't send the TCP connection 
> information to all the other nodes since only one node who take over 
> that IP will use it. But surely, using Corosync API or openAIS service 
> can do the job, especially to the scenario where the user doesn't have 
> a cluster-visible storage. 

IMO a solution that doesn't rely on shared storage is preferable.

> The other problem is we monitor the established TCP connections every 
> other interval so it is not very precise since things may have changed 
> in one interval. So an event-driven mechanism should be ideal. When 
> the TCP connections have changed, the kernel notify this info to 
> user-space, a daemon in user-space then handle this info. It seems 
> tcp_diag can provide this function in the kernel-space, we just need 
> to write the user-space program to talk to tcp_diag. (Is that so?) I'm 
> going to do some investigation about this and if it is feasible, I'd 
> like to implement it. If anyone know more about the tcp_diag or have 
> other idea about how to implement the event-driven mechanism or you 
> think no need to try this, please comment :) 
>  
> So, there is another way to implement this feature (openais API + 
> tcp_diag), it is a little complicated but should be more precise. 
> Right?  But I would like to implement the simple way at first, Hope it 
> can meet most of the users needs. I'm very appreciated to your input 
> (especially tell me most of the production environment is like and 
> what is most admins are complaining.) 

I can't comment on tcp_diag (haven't looked at it), but I'd like to
comment on a couple of things in your earlier email:

On 12/24/2009 at 01:55 AM, Jiaju Zhang <[email protected]> wrote: 
> For cluster ip clones, if one node dead, you should provide a timing 
> that other nodes do the tickle. In original patch, tickle is called 
> when the IPaddr2 RA started, but for clones, no more resource group 
> will start if one node dead. So I think it can be added to monitor 
> operation. It is not event-driven but so many things is not 
> event-driven and is not precise, so I think this should be acceptable. 
> The second thing is how do you know who is dead, you can easily get 
> this info via pacemaker, but the "hostname" part should also be 
> reserved since we should use it to differentiate the info in different 
> node. The last thing is who should do the tickle, I think I can let 
> the DC do this or just every other alive nodes do this as well. 

I'm really not sure how best to apply this to cluster IP clones.
Only one node should do the tickle though.

> Another important thing I think we should address is if the tickle 
> feature should be added in IPaddr2 RA? When you deploy your HA 
> solution, maybe sometimes you should configure the application service 
> started after the IPaddr2 started, but sometimes you should configure 
> IPaddr2 as the first-started resource then started the application. If 
> it is the latter, if you tickle ACK when IPaddr2 started, but the real 
> service application is not started at that time, the user may see the 
> error like "Port is not reachable", this is not a good usability. So 
> we may need to start the tickle when the application is ready. One 
> simple implementation of this is to add the tickle feature in a 
> seperated RA and add it to the last in the service group when you 
> deploy it. Does this make sence? If yes, I'll implement it :) 

Yes, this is a good point.  It may be that we actually want to do
something like this:

  start:
    1) add iptables rule to drop incoming packets to IP address
    2) bring up IP address 
    3) bring up HA service (database, storage, web server, whatever)
    4) remove iptables blocking rule
    5) perform tickle ack

  stop (reverse of above, but fewer steps necessary):
    1) add iptables rule to drop incoming packets to IP address
    2) stop HA service
    3) bring down IP address

In the "start" case, I can imagine the IPaddr2 RA doing steps 1 and 2,
whatever existing RA(s) doing step 3, then a separate "tickle" RA doing
steps 4 and 5.  Likewise in reverse for stop.  Without something like
this, there's at least two windows of opportunity where clients are
either refused, or see the connection close (between steps 2 & 3 during
"start", and any time after step 2 in "stop" when doing a clean migrate
from one node to another).

Regards,

Tim


-- 
Tim Serong <[email protected]>
Senior Clustering Engineer, Novell Inc.


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to