Re: [Ltsp-discuss] Failover and load balancing question....

Hans Ekbrand Fri, 27 Sep 2002 00:43:28 -0700

On Thu, Sep 26, 2002 at 08:49:07AM -0400, David Johnston wrote:
> On Wed, 2002-09-25 at 19:42, Hans Ekbrand wrote:
> > On Wed, Sep 25, 2002 at 02:16:40PM -0400, David Johnston wrote:
> > > On Wed, 2002-09-25 at 05:19, Tom Lisjac wrote:
> > [...]
> > > > I'd like to set these labs up in other schools but the single point
> > > > of failure and lack of scalability makes me nervous
> 
> > > Tom,
> > > I would run dhpcd on one machine, rsync the dhcpd.leases file from
> > > the server to the second machine every hour or so, and use the HA
> > > heartbeat to start the second machine's dhcpd whenever the primary
> > > failed.
> > The point of alternate the dhcp servers is not that they themselves
> > put load on their respective box, but to make the servers share the
> > login sessions, since the server that will offer the
> > IP/kernel/NFS-root will also be the server that will be queried for a
> > login promt by the workstion.
> AH.  I completely missed that possibility.  Is there another way to
> accomplish this?


If you mean to load-balance the login sessions (at XDMCP level), then
yes I think so: edit rc.local and make the workstations use X -broadcast rather
than X -query ip.of.application.server

I haven't tested how fair the load-balancing will get, though

> > > To divide the load, you can set up specialized servers.  One runs your
> > > window managers, another runs all the browsers, another runs your office
> > > suite.  In this setup, it's possible for one app to be unavailable while
> > > everything else continues to work.
> > 
> > This is not what the OP wanted. "If one server goes down, I'd like the
> > lab to simply slow down... not stop."
> I realize that.  I was trying to point out that linux-ha won't get us
> what the OP wanted, but that it is possible to mitigate the risks.  I'm
> sorry I wasn't clear; re-reading the message, I think I accidentally
> edited out part of what I was getting at, which is that we aren't really
> ready for what the OP wants.  

I think we are ;-)

> > With the exception that the users currently logged in to the server
> > that goes down will lose their sessions, a reboot of the workstations
> > should give a login prompt to the server that is left. (Depending on
> > the backup/sync rutines used, some data in /~ can be lost, but only
> > (some of the) changes done in the lost session, users that want better
> > crash recovery than that are simply not realistic, but might be
> > frequent in a "local elementary school" ;-)
> > 
> > > As an alternative, you can use the linux-ha heartbeat software to set up
> > > a fallback server.  If the primary server goes down, the workstations
> > > will all fail but they will be able to sign into the fallback server
> > > almost immediately.  For this to work, you have to use something like
> > > NAS so that loosing a server doesn't mean loosing access to the
> > > data. 
> > 
> > What is NAS?
> NAS is "Network Attached Storage".  You set up a file server (NFS or
> SMB) that is only a file server, and that only communicates with your
> LTSP servers.  This is based on the principle (or is it just a hope?)
> that a single-purpose machine is less likely to fail than a
> multi-purpose machine.  This machine will hold users' files.  The LTSP
> servers mount /home from the NAS.  This way, if an LTSP server goes
> down, the files are still available and you don't need rsync, et cetera.

A. you don't NEED NAS that something like that to work, I and would
actually try avoid it.

B. That a single-purpose machine is less likely to fail, is only true
when it comes to the software, but the risks I am concentrating on is
hardware failure (primarily harddisks, but also PSUs and there might
be other hardware failures of corse). Some hardware failures might be
easy to fix (e.g. exchange a network-cable), but if you/the adminstrator
is not available, then you might end up with many angry users.

> > What do you think of my suggestion with a check in
> > Xstartup if ~/ are in sync with the other server, combined with a
> > logout script that syncs and leaves a file in ~/ that says that ~/ are
> > in sync?
> The problem I see with your Xstartup idea is that the only time your
> sync. script is necessary is when a server has gone down, which is also
> the only time when the sync. script cannot access the data it needs.

No, if the user ends the session in a way that don't trigger the
logout sync script: eg

*just push the power-button on the terminal
*issues kill windowmanager -9
*press ctrl-alt-backspace

> If you must keep two machines in sync, you have to do it in real time
> (or close to it).  One way is to cross-connect the SCSI chains of the
> two servers; each server has two disks, its own and its brother's
> failover.  When one server detects that its brother is down, it mounts
> its brother's failover disk and does its brother's job until the failed
> machine recovers.
> 
> However, neither NAS nor interconnected SCSI addresses the issue of
> incomplete file updates (ie, a server or client crash in the middle of a
> file update).  Current thinking seems to be that this cannot be
> addressed at the hardware or O/S level; it must be addressed at the
> application level.

I think it is a good thing NOT to sync in real-time, since if the
server you are working on is having hardware problems you might end up
with currupted data, and those you don't want synced.

> For example, when Galeon starts up after a Galeon crash, it recovers my
> previous session as bookmarks.  This is part of Galeon.
> 
> As a better example, any decent database server can take a series of
> transactions and only commit them if the complete series is successful. 
> However, for this to work, the DB frontend (ie, the application) has to
> tell the server that a given series of transactions are interdependent.
> 
> We have to depend on the applications (Open Office, Galeon, etc) to do
> this.

Agreed.

> > > If the data are rapidly changing and critical, you can use AFS or
> > > shared-scsi disks.
> > 
> > For "a local elementary school" that might be an overkill ;-)
> 
> I think you're right.  I would like to ask the group to discuss the
> possibility that what OP wants is overkill for a local elementary
> school, as well.  I went down the same road the OP is going down
> (eliminating all single points of failure) for a business client, and
> ran into several potential solutions and a lot of dead ends.

I just want to be assured that if one ltsp-server goes down (by
hardware failure) users can contiune to use their terminals (though
their current session will be lost). By giving the users access to the
sync-script, (eg. by a menu or a button on the desktop) they can even
be assured that their latest savings will not be lost even if their
current server goes down hard.

That could be an important feature, and a selling argument, of the
ltsp concept.

>  I have
> since tried a different tack; instead of trying to eliminate the
> possibility of failure, I'm trying to minimize the effects of a
> failure.  In other words, I can't promise that a workstation won't
> crash, but if it does you should be back at work in under a minute.
> 
> I think it would be great if this discussion gets us around some of
> these dead ends I found.

What were they?

-- 

Hans Ekbrand

msg08223/pgp00000.pgp
Description: PGP signature

Re: [Ltsp-discuss] Failover and load balancing question....

Reply via email to