[Ltsp-discuss] [ltsp-discuss] Network clients stop responding for a while after 4 or 5 units boot up.

Mark David Dumlao Fri, 14 Nov 2008 14:32:56 -0800

Hello ltsp-discuss list!

I'm setting up a thin client laboratory for a computer school in the
Philippines, in the Visayas region, and I'm encountering some difficulties
with getting the clients to boot up.


Problem statement:
I have a lab of 20 units booting off of a lower-middle class desktop
converted into an Ubuntu LTSP server with increased RAM and disk space. The
server is able to boot all of the clients individually without hitches,
recording invidual boot times of 1-3 minutes in isolation. However, when 4
or 5 clients start up, the rest of the lab is unable to start up or even get
IP addresses. Also, when booting several units at a time, the boot process
sometimes hangs for 3-5 minutes before entering the Login screen.

Details:
I am running Ubuntu 8.04 LTS (Hardy Heron), and using it to boot up an
entire lab of second hand, beat down Pentium III units. My server specs are
as follows:

Server:
processor: Pentium 4, 2.4GHz, 512 kb cache
RAM: 2GB DDR SDRAM
disks: 2x 80GB PATA Hard Disks -> software RAID
md0 = 10gigs root partition
2 gigs swap per hard disk
md1 = 55 gigs home partition
NICs: 2x100 mb/s ethernet
eth1 - 192.168.1.8 (facing the school network)
eth2 - 192.168.11.254 (facing the lab network [thin clients])

As you may have noticed, it's not a brand new server, but rather, one of the
units which were lying around that I happened to notice and say "hey, that
could totally work." The only thing new about it is the RAM and the hard
disks, which are second hand. The disks are used to run a software RAID1
setup, because I was under the impression that RAID1 could theoretically
increase my disk read speeds to about double.

I did a fresh install of Ubuntu 8.04.1 on the server. Ubuntu's installer has
an LTSP server install mode. It puts LTSP5 on the server, and created boot
disks for the clients. The clients are using a universal multidriver
etherboot boot diskette image which I downloaded from the net, called
eb_on_hd. The diskette is available from this site:
http://etherboot.anadex.de/

It's really neat. It works for unattended (the network operating system
below).

As for the clients, this laboratory consists of 20 refurbished Pentium 3
units. The highest RAM they have is 128MB. Some of them have 96MB only.
About 19 of them are fast ethernet (100mbps) although one of them has an old
10BaseT card. Their existing hard disks have Windows XP installed, and when
troubles happen, I also use my server to reinstall their XP using the
unattended network install method from this site:
http://unattended.sourceforge.net/

None of the units have PXE boot ROM builtin on their boards or NICs, which
is why I use the etherboot image from above.

Here's a quick ASCII diagram of the network setup:

{ Internet } <--> [school router] <--> [main switch] <--> [mars server] <-->
[lab switch] <--> {lab units}

The server, which we called mars, functions not only as an LTSP server, but
also as a stand-in firewall router for the units. That way, while I'm still
playing around with the LTSP stuff, there is no service interruption in the
labs, which are still using virus-laden XPs for class :(

In theory, all of the units seem to be working fine. Individually, I can get
the units to boot into LTSP, with boot times from 1-3 minutes, and I can
prove that their performance on the server is noticeably better than their
local performance (It's a miracle to me that we could get that happening on
our junktop, but that's what my few beta testing users say). However, A
problem occurs when I try to boot multiple units. What happens is that the
first four or five units boot up fine, but the sixth unit and so on seem to
be unable to even get an IP address. What happens is that after etherboot
loads and gets to the part where it's asking for an IP, and it keeps
returning "No IP address" after that point. Under windows, they are normally
able to get an IP address from the server, which is why I suspect the
problem only arises after LTSP starts to boot. Anyways, I have to cover all
points when troubleshooting this, because the owner of the school wants me
to replicate the setup to another lab and another branch with 4 labs. So
here's what I've covered so far:

Unless I mention otherwise, testing commands are run by booting one lab unit
into thin client mode, and running local tests on the server.

0) server capacity
My initial reaction was that the server is unable to handle the load.
Because of that, I turned on the GNOME system monitor applicaiton to track
system stats while I boot up. Everything goes as expected: the RAM usage
prior to logins is negligible. Ditto for CPU usage. The network spikes up
every time etherboot starts downloading the file - also expected. However,
after I boot up five units, I get _NO_ signs from the server that it is
hitting peak usage - CPU, memory, hard disk use, network are all under 20% -
and I highly suspect that the server is in fact able to handle more than 5
units.

At one point, to make sure that everything was fine, I booted up 20 units
one at a time, with perhaps 2-3 minute intervals between booting each one. I
was able to get the entire laboratory up and running, but the start time -
which was about an hour or more - was unacceptable for network use. I took
it at face value that after the Ubuntu login screen appears, the unit is
considered as good as working. The laboratory is meant to be used for basic
office and internet classes, and one of the teachers has been eager to
express interest in using the the server for class. However, I've been
telling him to hold off for a week now while I was hunting for my big bug.

1) endpoints and cabling.
I have a laptop that I used to test the network connections at every
endpoint. It has PXE rom builtin. My test was to plug in the laptop at that
endpoint and boot to thin client from there. Every endpoint seems to work
fine, so I cannot blame the cabling.

2) network capacity
when all of the units are booted into windows, they can browse the internet
together. It would seem to me that LTSP doesn't abuse the network as bad as
I think it would. However I have not rigorously tested network capacity as
of yet.

Since my problem appears every time I boot up about five or so units, I
tried the following test: I boot up five units, then when Etherboot tries to
boot but is unable to get an IP address, I remove that client from the
network and connect my laptop in its place. My theory is that my laptop
should be able to _acquire_ a network address at the very least. If it is
able to, and is able to download the tftp image, then there is nothing wrong
with the network, and I can blame the client unit or the etherboot OS.
However, when I get to that point I am unable to get an IP address on my
laptop either, meaning the whole network is unable to handle it for some
reason.

This bugged me, so I checked up on the server to find out what was the
problem. I ran the following commands to check up if the server was getting
it:
tail -f /var/log/messages
tail -f /var/log/daemon.log
sudo tcpdump -ni eth2

Also, I connected my laptop to the lab network, and ran tcpdump on it. Since
DHCP requests are broadcasts, even if I'm on a switch, I should still catch
them.

messages shows me what appear to be bootup messages from the thin clients.
However, I noticed something strange from the bootup messages: they seemed
to be coming even after I turned off the clients. It appears to me that
writing to disk was significantly delayed.

daemon.log and tcpdump shows me the activity of the dhcp server. Strangely
enough, even though etherboot reports that it's not getting any dhcp offers
(No IP address), the server clearly reports receiving the DHCP Request and
throwing back a DHCP Offer. This happens for _every unit_ after I hit my 4
or 5 client lease. I could also see from my laptop that the DHCP offers are
getting thrown back - but for some reason either Etherboot on the thin
clients is not getting them, or Etherboot is rejecting them. After I hit my
unit limit, no other unit can apparently get back a DHCP offer, even windows
logons.

To isolate the issue, just in case some rogue unit was saturating my
network, I shut down the entire lab and powered on the units one by one.
First, to windows, they can all use internet and get IPs just fine. Then I
shut down the entire lab again and power one by one - this time to network
booted Linux, in batches of 3 or 4 - the network stops getting IP addresses
after around the first batch. It will be a long wait, maybe 20 or so minutes
later, then 1 batch of units will be able to get IPs again.

3) buggy software servers / setups
The servers I think I need to monitor are:
dhcp3-server
tftpd-hpa
openssh

I don't know what settings to tweak or configure in dhcp3-server under
/etc/ltsp to make it handle more or optimize it. As far as I'm concerned,
dhcp3-server is the one with the problem, and I'm not even getting to the
tftpd-hpa (downloading pxelinux.0...) part yet.

I don't know what to check to determine the performance or throughput being
used by nbd. I disabled NBD_SWAP at my lts.conf in
/var/lib/tftpd/ltsp/i386/lts.conf, just to lower network utilization woes.
But the root device is still nbd anyway. Would NFS give me better
performance?

I also noticed that running sudo commands took a very long time, so I read
up on some stuff on ubuntuforums that seemed to point to it being a hostname
related issue. Sure enough, commands under su ran faster than sudo. My hosts
file contains the following stuff:
127.0.0.1 localhost
127.0.1.1 mars.schoolname.local. mars
192.168.1.8 mars.schoolname.local. mars
192.168.11.254 mars.schoolname.local. mars

To help out with my local network, I also installed bind9 on my server as a
caching nameserver. I noticed, however, that bind9 _also stops responding
after I've booted about 3-5 units_ - subsequent runs of "dig www.google.com.
@127.0.0.1" give me "no server could be contacted". I wasn't sure what was
causing the symptom. I did, however, switch to powerdns, and made the
symptom disappear.

bind9 was one of the programs being controlled by apparmor. I don't know how
to use apparmor, but I've encounterred selinux woes before - and nearly
every docu of every server Ive read goes the lazy way and says "if selinux
causes problems just disable it", rather than telling you how to configure
selinux to get the software to work. I assumed the same holds true for
apparmor, so I disabled apparmor. My problem with bind9 didn't go away - it
still refuses to respond to requests after I boot a few thin clients. I am
assuming something about my thin client stuff causes it to hang.

I mean bind9 itself hangs, not my network. I am running dig from mars
itself, and it is talking to the loopback interface, but it is not
responding even to itself. However it does respond _before_ I boot up a few
units, so it isn't a firewal issue. Speaking of firewalls, I installed
shorewall on the server. It NATs the lab network (192.168.11.0/24) to the
school network (192.168.1.0/24), and I set only weak default policies on
shorewall. powerdns (pdns) does not hang in the same way that bind does.

4) disk performance
I mentioned before that the following things "hang"
a) dhcp requests dont get in
b) bind9 doesnt answer stuff
c) sudo commands take a very long time
d) clients dont boot for a long time

I noticed that sometimes, when (c) resolves, it resolves at coincidentally
the same time that (b) bind starts answering again, that (a) clients start
getting IPs again, and that clients continue booting.

I also noticed that sometimes, the network graph from gnome's system monitor
applet very closely follows the hard disk usage graph. I forget how to
monitor hard disk usage "by hand", though.

I also remember that my system logs appear to be delayed: I appear to be
getting CPU messages from a thin client even after I reset the unit.

The three symptoms above lead me to think there is a hard disk caching
problem: I suspect that the system is waiting for such and such process to
finish before updating stuff on disk. Because there are numerous writes to
disk, when the kernel starts waiting on input/output, random processes that
wait for writes to finish, such as bind9, sudo logs, etc, all hang or wait
up. (the first note was also the reason why I thought that I _had_ to fix
bind. Interestingly, after I replaced bind with pdns, I was unable to
observe my sudo problem any more).

This hard disk caching issue can be caused by a few things:
a) a bad RAID setup
b) nbd magic stuffs
c) some unoptimized kernel scheduler option
d) something I don't know
e) maybe it isn't some hard disk caching issue after all

a) bad RAID setup
my understanding of RAID1 is that you get slightly worse write performance
(which is fine by me) but nearly n-tupled read performance where n is the
number of disks. I have 2 disks.  I don't see how it could be causing me any
troubles, however, when I saw that syslog writes were being delayed I
worried a bit. I think it's pretty sane thing to think it would help, but I
don't know if it could be causing me write problems.

Tomorrow, when I get back to my lab, I'll make sure that the hard disks are
on different channels, although I'm sure I already told me assistant to put
them on different channels, and I'm not sure if my hard disks are even fast
enough to hit half the channel capacity (havent checked the disk speeds, or
the channel capacity). Is channel capacity an issue?

I can determine if RAID is causing an issue by going into degraded mode.
I'll mark all of one disk's partitions as faulty and remove them. I already
did that for the root partition, but I didn't degrate my home partition just
yet. So far, no noticeable change after degrading my disk.

b) nbd magic stuffs
nbd, to me, is magic. I don't know what to look for or how to minimize its
impact, or if it's even causing a noticeable impact. I just know that
sometimes, my network usage closely traces my hard disk usage. Sometimes it
doesn't.

c) kernel scheduler something
I'm not sure what to look up, but I read before there was some kernel magic
you could do that has to do with how it schedules disk writes. I'm not sure
what the best choice for such an option would be, but I think that my
current setting - where some services become completely unresponsive - is
probably not the right one. I'd have thought, though, that the default
kernel from Ubuntu's LTSP install mode would use the right one.

d) something else
help on this?

e) it's not a disk caching issue at all
help on that?
===
So anyways, time for a
RESUMMARY:
Am encountering network problems in my lab - which manifest as clients being
unable to acquire network addresses - when starting up more than 4 thin
clients, I am unable to determine the cause.

my leads so far:
0 - My server can't really handle it
If so, I'd like to find out what resource, specifically, is causing the
issue, so that I know what to buy
(and yeah I'd like to get these guys to buy _at least 1 new unit_ but
they're waiting on my responses)

1 - My network can't handle it
switch? cabling? traffic? bandwidth?
this is suspect, because I'm sure other people have gotten LTSP to work on a
100mbps segment.

2 - Kernel Scheduling Settings
where do I look and what does it look like :)

3 - Etherboot is dumb
maybe etherboot is just discarding DHCP requests because it isn't very
smart. How much intelligence can you fit on a floppy disk anyhow?
But this is suspect, because other clients besides etherboot are also unable
to get IP addresses.

4 - faulty disks
I guess I'll find out tomorrow after I alternate setting the fail option on
the partitions. But I don't think it's that.

5 - disk channel
I'll make sure tomorrow that my RAID disks are on different channels. But I
think they already are, and I think that even if they were on the same
channel, the performance wouldn't be THAT bad.

6 - apparmor stuff
I might think I've killed it, because I disabled it from my boot profile by
changing it's rcS.d script from S37 to K37. But what if I missed something?
I dunno

7 - dhcp3 bug
probably should have been the first I listed. Is there a good replacement
server for dhcp3 that I can test out?

8 - something else
you tell me :)

===
If it sounds interesting enough to you, tell me if there's anything missing
from the info and I'd gladly fill in. I've been bonking my head over this
for the past few days now and I can't keep "blaming the network" lol.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_____________________________________________________________________
Ltsp-discuss mailing list.   To un-subscribe, or change prefs, goto:
      https://lists.sourceforge.net/lists/listinfo/ltsp-discuss
For additional LTSP help,   try #ltsp channel on irc.freenode.net

[Ltsp-discuss] [ltsp-discuss] Network clients stop responding for a while after 4 or 5 units boot up.

Reply via email to