On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca <bosi...@eecs.utk.edu>wrote:
> > On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > > > But the epoch is process-unique - i.e., it is the number of times that > this specific process has been started, which differs per proc since we > don't restart all the procs every time one fails. > > Yes the epoch is per process, but it is distributed among all participants. > The difficulty here is to make sure the global view of the processes > converges toward a common value of the epoch for each process. > Sounds racy...is it actually necessary to have a global agreement on epoch? Per my other note, perhaps we really need a primer on this epoch concept. > > > So if I look at the epoch of the proc sending me a message, I really > can't check it against my own value as the comparison is meaningless. All I > really can do is check to see if it changed from the last time I heard from > that proc, which would tell me that the proc has been restarted in the > interim. > > I fail to understand your statement here. However, comparing message epoch > is critical to ensure the correct behavior. It ensures we do not react on > old messages (that were floating in the system for some obscure reasons), > and that we have the right contact information for a specific peer (on the > correct epoch). > Again, maybe we need a better understanding of what you mean by epoch - clearly, there is misunderstanding of what you are proposing to do. I'm leery of anything that requires a general consensus as it creates a lot of race conditions - might work under certain circumstances, but we've been burned by that approach too many times. > george. > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >