Kapil,
Thanks.
I'll attempt a run tomorrow without pathvirt. Probably should have done this
already...but I didn't suspect an issue given the problematic paths are not in
the list of paths to be virtualized. Of course, we never suspect issues where
they are often actually hiding. :-)
Also, I really need pathvirt so I've probably been biased.
I'll be in touch.
-Kyle
> On Apr 28, 2016, at 7:53 PM, Kapil Arya <kapil.arya...@gmail.com> wrote:
>
> Hi Kyle,
>
> We usually test with NAS benchmarks that include fortran code. However, we
> almost never used the path-virt plugin in those cases. Could you provide a
> small test case that we can use to reproduce the bug locally? We can then
> potentially add it to the test suite as well.
>
> Having said that, I can certainly imagine issues related to working of
> ckpt-open-files with the path-virt plugin since the path-virt plugin is
> fairly recent and we have tested it only in very limited scenarios.
>
> Kapil
>
>
>> On Thu, Apr 28, 2016 at 5:17 PM, Kyle Harrigan <kwharri...@gmail.com> wrote:
>> How much, if any, testing have you guys done with programs compiled under
>> gfortran? Specifically, I'm running gfortran 4.4.7 included in RHEL6/
>> CentOS6.
>>
>> More specifically, I'm observing some bizarre behavior with file handling.
>> I'm using --checkpoint-open-files, and also a version of the pathvirt plugin
>> I've rebased onto 2.4.4
>> (https://github.com/kwharrigan/dmtcp/tree/pathvirt-v4-244). The offending
>> path is not located in an area I am virtualizing.
>>
>> On restart I've got some code doing io processing in Fortran. The specific
>> case, is using OPEN on an existing file (status=old). Before this is done,
>> an INQUIRE check is done to make sure the file actually exists on the
>> filesystem, which it does. I've verified this on the operating system
>> separately while running gdb stopped write at the open line.
>>
>> The OPEN call ends up throwing an error code 5004, which is
>> LIBERROR_ALREADY_OPEN.
>>
>> This particular error indicates that the file is already open under a
>> different fortran unit number (this is illegal to open the same file w/ a
>> different unit number under fortran). Note that prior to restart, I see no
>> evidence of this filename actually being used in an open call anywhere.
>> When it is thrown, I do an FNUM to get the posix file descriptor, and while
>> in the debugger I jump into /proc/fd for the restarted process. The
>> filename corresponding to the fd does *not* match the filename which
>> triggers the 5004, it is a totally different filename. Confusion ensues.
>>
>> So in summary, I've got an open call telling me a filename is already open
>> with some unit number, I pull the fd for that unit and the filename
>> corresponding to that fd in /proc/fd does not match the filename I'm
>> attempting to open.
>>
>> I'm not convinced this is an error in dmtcp, though I'm wondering if
>> something like the filename is being remapped behind the scenes, as some of
>> what I am observing makes absolutely no sense.
>>
>> A tour through libgfortran/io for the adventurous
>> (https://gcc.gnu.org/viewcvs/gcc/trunk/libgfortran/io/)
>>
>> The error thrown is in open.c around line 493. This means "find_file,"
>> which looks through the treap storing all the unit data, finds a file by
>> file descriptor.
>>
>> See line 1215 of unix.c in the same folder. This is what actually
>> determines if there is already a unit with that filename. It is using fstat
>> and comparing device and inode number. Perhaps in the restarted process we
>> can't trust these checks anymore? Maybe something hasn't been properly
>> virtualized?
>>
>> Any thoughts would be appreciated. I'll keep digging through the weeds.
>>
>>
>> --
>> -Kyle
>>
>> ------------------------------------------------------------------------------
>> Find and fix application performance issues faster with Applications Manager
>> Applications Manager provides deep performance insights into multiple tiers
>> of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial!
>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum