Erwan Velu <erwanalia...@gmail.com> writes: > [...] > > Hey Eric, > > That's a lot for all your incomes. > > I'll try to answer all your questions. > > 1°) Why do we need a particular behaving > My "cluster" is in a very particular network & power industrial environment. > Those low-power nodes are yes, booted in the same time and they do have to > load > some content from a central server.
I was thinking from my experienced with high performance computing. Your case is very similar but just with lower end hardware. That changes the assumptions about what is available a little bit but I don't think it should change on the wire behavior. > The network bandwidth is really low (100mbit) and each system have to load > some > content to boot as we are running diskless at that point. You speak of 100mbit and you speak of collisions. Is this 100mbit going through a switch? I don't recall many ways 100mbit can actually be a shared fabric, I think all of that was back in 10mbit type connections. There is a bit of buffer advantage to going through a real switch but I don't expect it makes much of a real world difference when every machine is talking to the same server. > The last point is our "main" server is very very light so I'd like to not load > it too much. At least for dhcp the load should not be much. But let me play with some numbers. A 1500 byte frame on the wire takes roughly 1538 bytes, what with preamble, ethernet header, checksum and interframe gap. At 100mbit you can get roughly 8127 of full sized packets per second, and in practice you should be sending noticeably smaller packets. For a dhcp transaction you need 4 packets you need packets: request reply and a bidirectional ack. For 720 clients that is 2880 packets Well within the the 8k you can perform per second. To keep server load down (as in load average) you might want multicast or an event oriented server instead of a process/thread per client model but otherwise I really don't see a problem. A 100mbit stream is tiny. > 2°) Why I'm using a 30sec delay > While computing a random to sleep to avoid collisions, I have to insure that > few systems aren't in the same time-stamp. Having systems with exactly the same time-stamp is a problem if that is your only input to a random number generator, but beyond that I don't see why having the same time-stamp is a concern. Any reasonable protocol should have additional differentiators besides time. > Let's say a random 15s delay with 720 systems booting at the same time, if > random thing generates too close bets (let's image an average distance of 7sec > between systems), we'll face the collisions problems. Collision problems? I would think a few systems doing exponential backoff in the face of collisions or dropped packets will give you the delay you are looking for, and naturally taking things out of lockstep without the assumption that everyone is lockstep at the beginning of time. > So my first guess is 30sec is enough to avoid too much systems trying to > download stuff at the same time. > This value will surely be changed while increasing our experience in our > environment. Honestly I think that is a silly way to look at things. > 3°) Adding more randomization > Agree with you, we have some improvements to do in the random() call and I > think that's pretty efficient using part of the MAC address to generate some > seed. > What do you think about using the cmos time too ? In the situations I have dealt with it is the cmos time that is in lockstep because of clocks being synchronized with ntp. So I don't think mixing the cmos time in is wrong, I also don't think it is particularly interesting. > 4°) GPXE integration > I perfectly understand my need isn't common at all and doesn't have to be > integrated as it in gPXE. > That said, this patch would have been made with a default value of > MAX_RANDOM_SLEEP_TIME set to 0 to disable this behavior. The initial delay is not common, and I would argue that the initial delay is almost certainly unnecessary and a little bit wrong. It is just a case of inserting a magic delay somewhere and hoping that makes things work. That is almost always the sign of a bug, and of impending trouble when the world acts differently than your delay. The problem of congestion is common, and all of the protocols are specified for what happens in a congested network, this looks to me like all that is needed is simply bug fixing of the congestion handling rather than any special handling being needed. With that bug fixing everyone will benefit. It looks to me that there is a bug in that src/net/retry.c because it does not add any jitter when it is performing exponential backoff. This leaves the possibility that two or more machines could cause packet drops and collisions by backing off exactly in step. Other than not introducing jitter in the backoff it looks like the current gpxe implementation should handle what you are doing just fine. Eric _______________________________________________ gPXE-devel mailing list gPXE-devel@etherboot.org http://etherboot.org/mailman/listinfo/gpxe-devel