Re: [Ltsp-discuss] [ltsp-discuss] Network clients stop responding for a while after 4 or 5 units boot up.

Rob Owens Mon, 17 Nov 2008 06:31:17 -0800

I think that server should be able to handle 10 clients or more, based
on my experience.  It'll depend somewhat on what applications you're
running, but for basic office stuff 10 clients should be no problem.


Check /etc/ltsp/dhcpd.conf to make sure that you have more than 5
addresses in your dynamic pool.  Look for the line that starts with "range"

Lastly, are you sure you're using a switch and not a hub?  A hub can
cause the kind of problems you're seeing.  I'd bet money that this is
your problem.

-Rob

Mark David Dumlao wrote:
> Hello ltsp-discuss list!
> 
> I'm setting up a thin client laboratory for a computer school in the
> Philippines, in the Visayas region, and I'm encountering some
> difficulties with getting the clients to boot up.
> 
> Problem statement:
> I have a lab of 20 units booting off of a lower-middle class desktop
> converted into an Ubuntu LTSP server with increased RAM and disk space.
> The server is able to boot all of the clients individually without
> hitches, recording invidual boot times of 1-3 minutes in isolation.
> However, when 4 or 5 clients start up, the rest of the lab is unable to
> start up or even get IP addresses. Also, when booting several units at a
> time, the boot process sometimes hangs for 3-5 minutes before entering
> the Login screen.
> 
> Details:
> I am running Ubuntu 8.04 LTS (Hardy Heron), and using it to boot up an
> entire lab of second hand, beat down Pentium III units. My server specs
> are as follows:
> 
> Server:
> processor: Pentium 4, 2.4GHz, 512 kb cache
> RAM: 2GB DDR SDRAM
> disks: 2x 80GB PATA Hard Disks -> software RAID
> md0 = 10gigs root partition
> 2 gigs swap per hard disk
> md1 = 55 gigs home partition
> NICs: 2x100 mb/s ethernet
> eth1 - 192.168.1.8 <http://192.168.1.8> (facing the school network)
> eth2 - 192.168.11.254 <http://192.168.11.254> (facing the lab network
> [thin clients])
> 
> As you may have noticed, it's not a brand new server, but rather, one of
> the units which were lying around that I happened to notice and say
> "hey, that could totally work." The only thing new about it is the RAM
> and the hard disks, which are second hand. The disks are used to run a
> software RAID1 setup, because I was under the impression that RAID1
> could theoretically increase my disk read speeds to about double.
> 
> I did a fresh install of Ubuntu 8.04.1 on the server. Ubuntu's installer
> has an LTSP server install mode. It puts LTSP5 on the server, and
> created boot disks for the clients. The clients are using a universal
> multidriver etherboot boot diskette image which I downloaded from the
> net, called eb_on_hd. The diskette is available from this site:
> http://etherboot.anadex.de/
> 
> It's really neat. It works for unattended (the network operating system
> below).
> 
> As for the clients, this laboratory consists of 20 refurbished Pentium 3
> units. The highest RAM they have is 128MB. Some of them have 96MB only.
> About 19 of them are fast ethernet (100mbps) although one of them has an
> old 10BaseT card. Their existing hard disks have Windows XP installed,
> and when troubles happen, I also use my server to reinstall their XP
> using the unattended network install method from this site:
> http://unattended.sourceforge.net/
> 
> None of the units have PXE boot ROM builtin on their boards or NICs,
> which is why I use the etherboot image from above.
> 
> Here's a quick ASCII diagram of the network setup:
> 
> { Internet } <--> [school router] <--> [main switch] <--> [mars server]
> <--> [lab switch] <--> {lab units}
> 
> The server, which we called mars, functions not only as an LTSP server,
> but also as a stand-in firewall router for the units. That way, while
> I'm still playing around with the LTSP stuff, there is no service
> interruption in the labs, which are still using virus-laden XPs for class :(
> 
> In theory, all of the units seem to be working fine. Individually, I can
> get the units to boot into LTSP, with boot times from 1-3 minutes, and I
> can prove that their performance on the server is noticeably better than
> their local performance (It's a miracle to me that we could get that
> happening on our junktop, but that's what my few beta testing users
> say). However, A problem occurs when I try to boot multiple units. What
> happens is that the first four or five units boot up fine, but the sixth
> unit and so on seem to be unable to even get an IP address. What happens
> is that after etherboot loads and gets to the part where it's asking for
> an IP, and it keeps returning "No IP address" after that point. Under
> windows, they are normally able to get an IP address from the server,
> which is why I suspect the problem only arises after LTSP starts to
> boot. Anyways, I have to cover all points when troubleshooting this,
> because the owner of the school wants me to replicate the setup to
> another lab and another branch with 4 labs. So here's what I've covered
> so far:
> 
> Unless I mention otherwise, testing commands are run by booting one lab
> unit into thin client mode, and running local tests on the server.
> 
> 0) server capacity
> My initial reaction was that the server is unable to handle the load.
> Because of that, I turned on the GNOME system monitor applicaiton to
> track system stats while I boot up. Everything goes as expected: the RAM
> usage prior to logins is negligible. Ditto for CPU usage. The network
> spikes up every time etherboot starts downloading the file - also
> expected. However, after I boot up five units, I get _NO_ signs from the
> server that it is hitting peak usage - CPU, memory, hard disk use,
> network are all under 20% - and I highly suspect that the server is in
> fact able to handle more than 5 units.
> 
> At one point, to make sure that everything was fine, I booted up 20
> units one at a time, with perhaps 2-3 minute intervals between booting
> each one. I was able to get the entire laboratory up and running, but
> the start time - which was about an hour or more - was unacceptable for
> network use. I took it at face value that after the Ubuntu login screen
> appears, the unit is considered as good as working. The laboratory is
> meant to be used for basic office and internet classes, and one of the
> teachers has been eager to express interest in using the the server for
> class. However, I've been telling him to hold off for a week now while I
> was hunting for my big bug.
> 
> 1) endpoints and cabling.
> I have a laptop that I used to test the network connections at every
> endpoint. It has PXE rom builtin. My test was to plug in the laptop at
> that endpoint and boot to thin client from there. Every endpoint seems
> to work fine, so I cannot blame the cabling.
> 
> 2) network capacity
> when all of the units are booted into windows, they can browse the
> internet together. It would seem to me that LTSP doesn't abuse the
> network as bad as I think it would. However I have not rigorously tested
> network capacity as of yet.
> 
> Since my problem appears every time I boot up about five or so units, I
> tried the following test: I boot up five units, then when Etherboot
> tries to boot but is unable to get an IP address, I remove that client
> from the network and connect my laptop in its place. My theory is that
> my laptop should be able to _acquire_ a network address at the very
> least. If it is able to, and is able to download the tftp image, then
> there is nothing wrong with the network, and I can blame the client unit
> or the etherboot OS. However, when I get to that point I am unable to
> get an IP address on my laptop either, meaning the whole network is
> unable to handle it for some reason.
> 
> This bugged me, so I checked up on the server to find out what was the
> problem. I ran the following commands to check up if the server was
> getting it:
> tail -f /var/log/messages
> tail -f /var/log/daemon.log
> sudo tcpdump -ni eth2
> 
> Also, I connected my laptop to the lab network, and ran tcpdump on it.
> Since DHCP requests are broadcasts, even if I'm on a switch, I should
> still catch them.
> 
> messages shows me what appear to be bootup messages from the thin
> clients. However, I noticed something strange from the bootup messages:
> they seemed to be coming even after I turned off the clients. It appears
> to me that writing to disk was significantly delayed.
> 
> daemon.log and tcpdump shows me the activity of the dhcp server.
> Strangely enough, even though etherboot reports that it's not getting
> any dhcp offers (No IP address), the server clearly reports receiving
> the DHCP Request and throwing back a DHCP Offer. This happens for _every
> unit_ after I hit my 4 or 5 client lease. I could also see from my
> laptop that the DHCP offers are getting thrown back - but for some
> reason either Etherboot on the thin clients is not getting them, or
> Etherboot is rejecting them. After I hit my unit limit, no other unit
> can apparently get back a DHCP offer, even windows logons.
> 
> To isolate the issue, just in case some rogue unit was saturating my
> network, I shut down the entire lab and powered on the units one by one.
> First, to windows, they can all use internet and get IPs just fine. Then
> I shut down the entire lab again and power one by one - this time to
> network booted Linux, in batches of 3 or 4 - the network stops getting
> IP addresses after around the first batch. It will be a long wait, maybe
> 20 or so minutes later, then 1 batch of units will be able to get IPs again.
> 
> 3) buggy software servers / setups
> The servers I think I need to monitor are:
> dhcp3-server
> tftpd-hpa
> openssh
> 
> I don't know what settings to tweak or configure in dhcp3-server under
> /etc/ltsp to make it handle more or optimize it. As far as I'm
> concerned, dhcp3-server is the one with the problem, and I'm not even
> getting to the tftpd-hpa (downloading pxelinux.0...) part yet.
> 
> I don't know what to check to determine the performance or throughput
> being used by nbd. I disabled NBD_SWAP at my lts.conf in
> /var/lib/tftpd/ltsp/i386/lts.conf, just to lower network utilization
> woes. But the root device is still nbd anyway. Would NFS give me better
> performance?
> 
> I also noticed that running sudo commands took a very long time, so I
> read up on some stuff on ubuntuforums that seemed to point to it being a
> hostname related issue. Sure enough, commands under su ran faster than
> sudo. My hosts file contains the following stuff:
> 127.0.0.1 <http://127.0.0.1> localhost
> 127.0.1.1 <http://127.0.1.1> mars.schoolname.local. mars
> 192.168.1.8 <http://192.168.1.8> mars.schoolname.local. mars
> 192.168.11.254 <http://192.168.11.254> mars.schoolname.local. mars
> 
> To help out with my local network, I also installed bind9 on my server
> as a caching nameserver. I noticed, however, that bind9 _also stops
> responding after I've booted about 3-5 units_ - subsequent runs of "dig
> www.google.com <http://www.google.com>. @127.0.0.1 <http://127.0.0.1>"
> give me "no server could be contacted". I wasn't sure what was causing
> the symptom. I did, however, switch to powerdns, and made the symptom
> disappear.
> 
> bind9 was one of the programs being controlled by apparmor. I don't know
> how to use apparmor, but I've encounterred selinux woes before - and
> nearly every docu of every server Ive read goes the lazy way and says
> "if selinux causes problems just disable it", rather than telling you
> how to configure selinux to get the software to work. I assumed the same
> holds true for apparmor, so I disabled apparmor. My problem with bind9
> didn't go away - it still refuses to respond to requests after I boot a
> few thin clients. I am assuming something about my thin client stuff
> causes it to hang.
> 
> I mean bind9 itself hangs, not my network. I am running dig from mars
> itself, and it is talking to the loopback interface, but it is not
> responding even to itself. However it does respond _before_ I boot up a
> few units, so it isn't a firewal issue. Speaking of firewalls, I
> installed shorewall on the server. It NATs the lab network
> (192.168.11.0/24 <http://192.168.11.0/24>) to the school network
> (192.168.1.0/24 <http://192.168.1.0/24>), and I set only weak default
> policies on shorewall. powerdns (pdns) does not hang in the same way
> that bind does.
> 
> 4) disk performance
> I mentioned before that the following things "hang"
> a) dhcp requests dont get in
> b) bind9 doesnt answer stuff
> c) sudo commands take a very long time
> d) clients dont boot for a long time
> 
> I noticed that sometimes, when (c) resolves, it resolves at
> coincidentally the same time that (b) bind starts answering again, that
> (a) clients start getting IPs again, and that clients continue booting.
> 
> I also noticed that sometimes, the network graph from gnome's system
> monitor applet very closely follows the hard disk usage graph. I forget
> how to monitor hard disk usage "by hand", though.
> 
> I also remember that my system logs appear to be delayed: I appear to be
> getting CPU messages from a thin client even after I reset the unit.
> 
> The three symptoms above lead me to think there is a hard disk caching
> problem: I suspect that the system is waiting for such and such process
> to finish before updating stuff on disk. Because there are numerous
> writes to disk, when the kernel starts waiting on input/output, random
> processes that wait for writes to finish, such as bind9, sudo logs, etc,
> all hang or wait up. (the first note was also the reason why I thought
> that I _had_ to fix bind. Interestingly, after I replaced bind with
> pdns, I was unable to observe my sudo problem any more).
> 
> This hard disk caching issue can be caused by a few things:
> a) a bad RAID setup
> b) nbd magic stuffs
> c) some unoptimized kernel scheduler option
> d) something I don't know
> e) maybe it isn't some hard disk caching issue after all
> 
> a) bad RAID setup
> my understanding of RAID1 is that you get slightly worse write
> performance (which is fine by me) but nearly n-tupled read performance
> where n is the number of disks. I have 2 disks.  I don't see how it
> could be causing me any troubles, however, when I saw that syslog writes
> were being delayed I worried a bit. I think it's pretty sane thing to
> think it would help, but I don't know if it could be causing me write
> problems.
> 
> Tomorrow, when I get back to my lab, I'll make sure that the hard disks
> are on different channels, although I'm sure I already told me assistant
> to put them on different channels, and I'm not sure if my hard disks are
> even fast enough to hit half the channel capacity (havent checked the
> disk speeds, or the channel capacity). Is channel capacity an issue?
> 
> I can determine if RAID is causing an issue by going into degraded mode.
> I'll mark all of one disk's partitions as faulty and remove them. I
> already did that for the root partition, but I didn't degrate my home
> partition just yet. So far, no noticeable change after degrading my disk.
> 
> b) nbd magic stuffs
> nbd, to me, is magic. I don't know what to look for or how to minimize
> its impact, or if it's even causing a noticeable impact. I just know
> that sometimes, my network usage closely traces my hard disk usage.
> Sometimes it doesn't.
> 
> c) kernel scheduler something
> I'm not sure what to look up, but I read before there was some kernel
> magic you could do that has to do with how it schedules disk writes. I'm
> not sure what the best choice for such an option would be, but I think
> that my current setting - where some services become completely
> unresponsive - is probably not the right one. I'd have thought, though,
> that the default kernel from Ubuntu's LTSP install mode would use the
> right one.
> 
> d) something else
> help on this?
> 
> e) it's not a disk caching issue at all
> help on that?
> ===
> So anyways, time for a
> RESUMMARY:
> Am encountering network problems in my lab - which manifest as clients
> being unable to acquire network addresses - when starting up more than 4
> thin clients, I am unable to determine the cause.
> 
> my leads so far:
> 0 - My server can't really handle it
> If so, I'd like to find out what resource, specifically, is causing the
> issue, so that I know what to buy
> (and yeah I'd like to get these guys to buy _at least 1 new unit_ but
> they're waiting on my responses)
> 
> 1 - My network can't handle it
> switch? cabling? traffic? bandwidth?
> this is suspect, because I'm sure other people have gotten LTSP to work
> on a 100mbps segment.
> 
> 2 - Kernel Scheduling Settings
> where do I look and what does it look like :)
> 
> 3 - Etherboot is dumb
> maybe etherboot is just discarding DHCP requests because it isn't very
> smart. How much intelligence can you fit on a floppy disk anyhow?
> But this is suspect, because other clients besides etherboot are also
> unable to get IP addresses.
> 
> 4 - faulty disks
> I guess I'll find out tomorrow after I alternate setting the fail option
> on the partitions. But I don't think it's that.
> 
> 5 - disk channel
> I'll make sure tomorrow that my RAID disks are on different channels.
> But I think they already are, and I think that even if they were on the
> same channel, the performance wouldn't be THAT bad.
> 
> 6 - apparmor stuff
> I might think I've killed it, because I disabled it from my boot profile
> by changing it's rcS.d script from S37 to K37. But what if I missed
> something? I dunno
> 
> 7 - dhcp3 bug
> probably should have been the first I listed. Is there a good
> replacement server for dhcp3 that I can test out?
> 
> 8 - something else
> you tell me :)
> 
> ===
> If it sounds interesting enough to you, tell me if there's anything
> missing from the info and I'd gladly fill in. I've been bonking my head
> over this for the past few days now and I can't keep "blaming the
> network" lol.
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> 
> 
> ------------------------------------------------------------------------
> 
> _____________________________________________________________________
> Ltsp-discuss mailing list.   To un-subscribe, or change prefs, goto:
>       https://lists.sourceforge.net/lists/listinfo/ltsp-discuss
> For additional LTSP help,   try #ltsp channel on irc.freenode.net
********************************************************

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. If you are not the addressee, any disclosure, reproduction,
copying, distribution, or other dissemination or use of this transmission in
error please notify the sender immediately and then delete this e-mail.
E-mail transmission cannot be guaranteed to be secure or error free as
information could be intercepted, corrupted lost, destroyed, arrive late or
incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
in the contents of this message which arise as a result of e-mail
transmission. If verification is required please request a hard copy
version.

********************************************************


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_____________________________________________________________________
Ltsp-discuss mailing list.   To un-subscribe, or change prefs, goto:
      https://lists.sourceforge.net/lists/listinfo/ltsp-discuss
For additional LTSP help,   try #ltsp channel on irc.freenode.net

Re: [Ltsp-discuss] [ltsp-discuss] Network clients stop responding for a while after 4 or 5 units boot up.

Reply via email to