I talked with a guy that's is doing parallel filesystem work, and according to him 80% of all filesystem operations when running an HPC job are for checkpointing (not that much restart). I just don't see how checkpointing can scale knowing how bad the parallel fs are.
Lucho On Fri, Apr 17, 2009 at 4:15 PM, ron minnich <rminn...@gmail.com> wrote: > if you want to look at checkpointing, it's worth going back to look at > Condor, because they made it really work. There are a few interesting > issues that you need to get right. You can't make it 50% of the way > there; that's not useful. You have to hit all the bits -- open /tmp > files, sockets, all of it. It's easy to get about 90% of it but the > last bits are a real headache. Nothing that's come along since has > really done the job (although various efforts claim to, you have to > read the fine print). > > ron > >