Hi Klaus, Thanks, the linux-ha setup is fairly complete I think, problem is the lustre timeout - the clients does not try the "other" fast enough. That's kind of the lustre timeout(s) I want to change along with some recovery options.
OT: Where can I read more about the "recovery" in Lustre, i've heard words like replay/recovery in some discussions here and im not sure I know what theese really mean 100% (from a Lustre point of view, the words are crystal clear ;-)). It seems like my manual is to old. Regards, Timh 2008/9/25 Klaus Steden <[EMAIL PROTECTED]>: > > Hi Timh, > > If you're using Linux-HA, you can configure how quickly failover takes > place. I have mine set to 90 seconds before the primary is marked dead and > the secondary takes over. > > When this occurs, any Lustre transactions not yet in flight will block until > the ones that were in progress at the time of the failure have either had a > chance to complete or have timed out. > > I'm not sure how to modify Lustre-specific settings for recovery time, > though. > > cheers, > Klaus > > > On 9/25/08 1:54 PM, "Timh Bergström" <[EMAIL PROTECTED]>did etch on > stone tablets: > >> To follow up on this matter, i've currently set ha/drbd as suggested, >> formatted the ost's with double mgsserver directives and also mounted >> with double addresses on the clients, as [EMAIL PROTECTED]:[EMAIL >> PROTECTED]:/fsname - >> though, if i fail mgs/mdt 1 it does not recover (in a resonable time), >> what kinds of tuning/settings will affect this? >> >> //Timh >> >> 2008/9/23 Timh Bergström <[EMAIL PROTECTED]>: >>> Thank you, that's the path i've taken from the last message on this >>> list, as I misunderstood some of the drbd/ha setups before. However, >>> using 4 mgsnode "paths", is that recommended or should I use one >>> mgspath per node and use the other as some sort of manual failover? >>> >>> Regards, >>> Timh >>> >>> 2008/9/23 Kevin Van Maren <[EMAIL PROTECTED]>: >>>> Note that you do not normally use IP takeover with Lustre/Heartbeat: you >>>> set >>>> the failover IP addresses with the mkfs.lustre command, and Lustre >>>> reconnects to the _other_ address when it is disconnected. >>>> >>>> In your case, you would have 2 fixed addresses for each node (w/o heartbeat >>>> - do NOT use the heartbeat virtual IP addresses), and specify both those >>>> failover NIDs (rather than just 1). >>>> >>>> Lustre1.6 is a bit different from a lot of HA/Heartbeat users: Lustre >>>> _knows_ about the multiple paths/addresses, and simply requires Heartbeat >>>> to >>>> ensure it is mounted on exactly one node in the failover pair: it does NOT >>>> rely on IP takeover for HA. >>>> >>>> Kevin Van Maren >>>> >>>> >>>> Timh Bergström wrote: >>>>> >>>>> 2008/9/23 Brian J. Murrell <[EMAIL PROTECTED]>: >>>>> >>>>>> >>>>>> On Tue, 2008-09-23 at 15:06 +0200, Timh Bergström wrote: >>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>> >>>>> Hi again, and thanks for the quick reply! >>>>> >>>>> >>>>>>> >>>>>>> My (current) modprobe: >>>>>>> >>>>>>> options lnet networks=tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50 >>>>>>> >>>>>> >>>>>> This syntax is incorrect. For some examples of multi-homed >>>>>> configurations see the manual at >>>>>> >>>>>> http://manual.lustre.org/manual/LustreManual16_HTML/MoreComplicatedConfigu >>>>>> rations.html#50642998_20213 >>>>>> >>>>> >>>>> Yes that's the link i've been consulting, perhaps im not looking hard >>>>> enough. >>>>> >>>>> >>>>>>> >>>>>>> This is the errors i get: >>>>>>> LustreError: 10f-e: Error parsing >>>>>>> 'networks="tcp0(eth0)10.4.21.50,tcp1(eth1)10.4.22.50"' >>>>>>> >>>>>> >>>>>> When you specify "networks" because you specify the interfaces to use, >>>>>> you don't need to specify the ip address. I think you are confusing the >>>>>> networks and ipnets options. >>>>>> >>>>> >>>>> The problem here exactly is that the physical interfaces is there, but >>>>> not with the ip-addresses i want the mdt to "listen" on - the "NIDs", >>>>> they are added later through heartbeat as aliases (IPaddr2::10.4.21.50 >>>>> IPaddr2::10.4.22.50), but before mounting the mdt-resource (drbd). >>>>> >>>>> >>>>>>> >>>>>>> LustreError: 110-0: here...............................|---------| >>>>>>> LustreError: 4527:0:(events.c:707:ptlrpc_init_portals()) network >>>>>>> initialisation failed >>>>>>> (along with a bunch of errors since this module does not load) >>>>>>> I've tried with tcp0(eth0:0) which fails with about the same error, >>>>>>> i've tried tcp0(eth0,eth1) which gives me the wrong addresses (machine >>>>>>> ones) but works. >>>>>>> >>>>>> >>>>>> What is the topology exactly? Are there two nics or one nic with two >>>>>> addresses? Are the two nics on the same physical network or separate >>>>>> physical networks? >>>>>> >>>>> >>>>> eth0 and eth1 are physical interfaces, they have statically assigned >>>>> ip's (for management, supervision etc), heartbeat then adds addresses >>>>> to theese two interfaces if the node is "primary". >>>>> >>>>> If it matters - eth0 and eth1 has separated physical paths to >>>>> everything, this is because we want to survive a physical fail on the >>>>> network before failing over to another physical server. >>>>> >>>>> As I read the manual, i format my OST's with more than one --mgsnode >>>>> option, which in turn will make the OST "know" about both path's to >>>>> the MDS/MGS server(s). As in, if first MGS does not work (physical >>>>> network failure on side A) - try second (Physical side B). >>>>> >>>>> What we healthcheck on is the data/disks/server hardware which will >>>>> tell heartbeat to fail over to server 2 which takes over network path >>>>> A and network path B (on 10.4.[21,22].50), and the OST's/clients >>>>> should continue working without noticing. >>>>> >>>>> >>>>>> >>>>>> b. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> [email protected] >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> Timh Bergström >>> System Administrator >>> Diino AB - www.diino.com >>> :wq >>> >> >> > > -- Timh Bergström System Administrator Diino AB - www.diino.com :wq _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
