First thing first, I would definitely isolate a segment of the network and make sure you get a true LTSP client(s) vs 1 LTSP Server (Server-->Crossover cable-->Client). You would then add a switch between the basic setup and add terms gradually until you get the lockup. It's then much easier to eliminate the source of the lockups. And since you have 2 servers, you could perform this as the others still use the present infrastructure. Depending on how much time it usually takes to get the lockups, you should isolate the source of the problem quickly and efficiently.
Eric
On Tue, 2004-09-21 at 07:37, Ricardo Araújo wrote:
Well, a little update on the problem. I couldn't yet run memtest all the way, I had it run once and no problems showed. I have also experimented changing memory modules (I have 2x512MB) by letting it run with only 512MB at a time, but the problems remains (random lock down). The situation is now as follows: I decided to split the machines among two servers. What was once a backup server is now operating as a server for 4 clients that were previously on the main server. I was hoping that that would put less pressure on servers and things would run smoothly. But now I have two servers locking down, exactly the same way, but not concurrently. So I'd better give an update on the topology of the network, as that might be important information after all. All 10 LTSP clients are connected through 100MB/s switches. On the same network there are about 7 machines still running Windows, no LTSP. A router handles DHCP for the Windows machines and Internet for those and also for the servers. Two servers runs LTSP, one providing access to 6 clients and other to 4. Both provide DHCP for their own clients and get IP from the router (all fixed). One server also provides a intranet web interface. Probably the main concerns are the fact that we have 3 DHCP servers running. The servers get their IP from the router and LTSP clients get IP from the servers. No big configuration was made in order to try to configure the DHCP to respond only to certain machines: the router responds to everyone, the LTSP servers only to the machines it serves. First I expected that that would be a problem, since maybe sometimes the client would get IP from the router and sometimes from the server, but somehow all clients get IP only from the server, as it is supposed to be. Even if it did that, it should only cause clients to mal-function, not servers locking down. Anyway, the problem now is that BOTH servers locks down randomly. It is never concurrently, which must say something about the nature of the problem. Also, I don't believe it is a hacker problem, I tried to switch off the internet and made sure no spurious connections are being made to the network (it is easy, since it is possible to have a global view from the room where all clients are). Running "top" shows no difference in server load from what should be expected before servers locks down. It is indeed a tricky problem. If both servers locked concurrently, things would be easier. But the fact that they don't and lock downs seems completely random is quite misterious to me. Thanks for the help so far. I hope we can find a solution and this problem doesn't remains as one of those LTSP great misteries... []s Ricardo. ------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _____________________________________________________________________ Ltsp-discuss mailing list. To un-subscribe, or change prefs, goto: https://lists.sourceforge.net/lists/listinfo/ltsp-discuss For additional LTSP help, try #ltsp channel on irc.freenode.net
|
-- Eric Thibodeau <[EMAIL PROTECTED]> |
