Hi Kyle,
We usually test with NAS benchmarks that include fortran code. However, we
almost never used the path-virt plugin in those cases. Could you provide a
small test case that we can use to reproduce the bug locally? We can then
potentially add it to the test suite as well.
Having said that, I can certainly imagine issues related to working of
ckpt-open-files with the path-virt plugin since the path-virt plugin is
fairly recent and we have tested it only in very limited scenarios.
Kapil
On Thu, Apr 28, 2016 at 5:17 PM, Kyle Harrigan <kwharri...@gmail.com> wrote:
> How much, if any, testing have you guys done with programs compiled under
> gfortran? Specifically, I'm running gfortran 4.4.7 included in RHEL6/
> CentOS6.
>
> More specifically, I'm observing some bizarre behavior with file
> handling. I'm using --checkpoint-open-files, and also a version of the
> pathvirt plugin I've rebased onto 2.4.4 (
> https://github.com/kwharrigan/dmtcp/tree/pathvirt-v4-244). The offending
> path is not located in an area I am virtualizing.
>
> On restart I've got some code doing io processing in Fortran. The
> specific case, is using OPEN on an existing file (status=old). Before this
> is done, an INQUIRE check is done to make sure the file actually exists on
> the filesystem, which it does. I've verified this on the operating system
> separately while running gdb stopped write at the open line.
>
> The OPEN call ends up throwing an error code 5004, which is
> LIBERROR_ALREADY_OPEN.
>
> This particular error indicates that the file is already open under a
> different fortran unit number (this is illegal to open the same file w/ a
> different unit number under fortran). Note that prior to restart, I see no
> evidence of this filename actually being used in an open call anywhere.
> When it is thrown, I do an FNUM to get the posix file descriptor, and while
> in the debugger I jump into /proc/fd for the restarted process. The
> filename corresponding to the fd does *not* match the filename which
> triggers the 5004, it is a totally different filename. Confusion ensues.
>
> So in summary, I've got an open call telling me a filename is already open
> with some unit number, I pull the fd for that unit and the filename
> corresponding to that fd in /proc/fd does not match the filename I'm
> attempting to open.
>
> I'm not convinced this is an error in dmtcp, though I'm wondering if
> something like the filename is being remapped behind the scenes, as some of
> what I am observing makes absolutely no sense.
>
> A tour through libgfortran/io for the adventurous (
> https://gcc.gnu.org/viewcvs/gcc/trunk/libgfortran/io/)
>
> The error thrown is in open.c around line 493. This means "find_file,"
> which looks through the treap storing all the unit data, finds a file by
> file descriptor.
>
> See line 1215 of unix.c in the same folder. This is what actually
> determines if there is already a unit with that filename. It is using
> fstat and comparing device and inode number. Perhaps in the restarted
> process we can't trust these checks anymore? Maybe something hasn't been
> properly virtualized?
>
> Any thoughts would be appreciated. I'll keep digging through the weeds.
>
>
> --
> -Kyle
>
>
> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications
> Manager
> Applications Manager provides deep performance insights into multiple
> tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum