How much, if any, testing have you guys done with programs compiled under
gfortran? Specifically, I'm running gfortran 4.4.7 included in RHEL6/
CentOS6.

More specifically, I'm observing some bizarre behavior with file handling.
I'm using --checkpoint-open-files, and also a version of the pathvirt
plugin I've rebased onto 2.4.4 (
https://github.com/kwharrigan/dmtcp/tree/pathvirt-v4-244).  The offending
path is not located in an area I am virtualizing.

On restart I've got some code doing io processing in Fortran.  The specific
case, is using OPEN on an existing file (status=old).  Before this is done,
an INQUIRE check is done to make sure the file actually exists on the
filesystem, which it does.  I've verified this on the operating system
separately while running gdb stopped write at the open line.

The OPEN call ends up throwing an error code 5004, which is
LIBERROR_ALREADY_OPEN.

This particular error indicates that the file is already open under a
different fortran unit number (this is illegal to open the same file w/ a
different unit number under fortran).  Note that prior to restart, I see no
evidence of this filename actually being used in an open call anywhere.
When it is thrown, I do an FNUM to get the posix file descriptor, and while
in the debugger I jump into /proc/fd for the restarted process.  The
filename corresponding to the fd does *not* match the filename which
triggers the 5004, it is a totally different filename.  Confusion ensues.

So in summary, I've got an open call telling me a filename is already open
with some unit number, I pull the fd for that unit and the filename
corresponding to that fd in /proc/fd does not match the filename I'm
attempting to open.

I'm not convinced this is an error in dmtcp, though I'm wondering if
something like the filename is being remapped behind the scenes, as some of
what I am observing makes absolutely no sense.

A tour through libgfortran/io for the adventurous (
https://gcc.gnu.org/viewcvs/gcc/trunk/libgfortran/io/)

The error thrown is in open.c around line 493.  This means "find_file,"
which looks through the treap storing all the unit data, finds a file by
file descriptor.

See line 1215 of unix.c in the same folder.  This is what actually
determines if there is already a unit with that filename.  It is using
fstat and comparing device and inode number.  Perhaps in the restarted
process we can't trust these checks anymore?  Maybe something hasn't been
properly virtualized?

Any thoughts would be appreciated.  I'll keep digging through the weeds.


-- 
-Kyle
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to