On Fri, Apr 17, 2009 at 7:01 PM, ron minnich <rminn...@gmail.com> wrote:
> On Fri, Apr 17, 2009 at 3:35 PM, J.R. Mauro <jrm8...@gmail.com> wrote:
>
>> Amen. Linux is currently having a seriously hard time getting C/R
>> working properly, just because of the issues you mention. The second
>> you mix in non-local resources, things get pear-shaped.
>
> it's not just non-local. It's local too.
>
> you are on a node. you open /etc/hosts. You C/R to another node with
> /etc/hosts open. What's that mean?
>
> You are on a node. you open a file in a ramdisk. Other programs have
> it open too. You are watching each other's writes. You C/R to another
> node with the file open. What's that mean?
>
> You are on a node. You have a pipe to a process on that node. You C/R
> to another node. Are you still talking at the end?
>
> And on and on. It's quite easy to get this stuff wrong. But true C/R
> requires that you get it right. The only system that would get this
> stuff mostly right that I ever used was Condor. (and, well the Apollo
> I think got it too, but that was a ways back).
>
> ron
>
>

Yeah, the problem's bigger than I thought (not surprising since I
didn't think much about it). I'm having a hard time figuring out how
Condor handles these issues. All I can see from the documentation is
that it gives you warnings.

I can imagine a lot of problems stemming from open files could be
resolved by first attempting to import the process's namespace at the
time of checkpoint and, upon that failing, using cached copies of the
file made at the time of checkpoint, which could be merged later.

But this still has the 90% problem you mentioned.

Reply via email to