Hi Eliot, Sorry for the delay in replying. I was hoping to get the bug fix done in time for the recent DMTCP release (2.4.0-rc1). But I ran out of time. You can see our current progress (among the developers) here: https://github.com/dmtcp/dmtcp/issues/40 It will be best to start reading from the last comment, and skip the early comments.
I'm still sorting through what's the best way to implement the proposal discussed at the end of "issue #40" above. Best, - Gene On Tue, Mar 17, 2015 at 10:37:15PM -0400, Eliot Moss wrote: > On 3/16/2015 10:36 PM, Eliot Moss wrote: > > > The error output is: > > > > [45000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId != > > -1) failed' > > (strerror((*__errno_location ()))) = File exists > > java (45000): Terminating... > > The previously reported issue printed: > > > [42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId > != -1) failed' > > (strerror((*__errno_location ()))) = No such file or directory > > java (42000): Terminating... > > So both are at the same line of code. They do not have to do with files, per > se, > but with semaphores and shared memory segments. I noticed that the protocol > on > restart mentions a node-wide file. That may explain why I can avoid the 'File > exists' case by running on another node, and also why, in the 'No such file' > case, > I can solve the bad behavior by running on the same node. Of course this > assume > that the file in question persists somewhere. (Where? Has to do with > PROTECTED_LIFEBOAT_FD.) > > Well, that as far as I got today picking the code apart. Hope this helps. > I feel stuck now in that I get one failure or the other, though not on the > same runs. I don't feel confident to fire off the 10,000 or so jobs I am > waiting to execute, so can't progress well until I resolve this ... > > Regards -- Eliot > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum