Re: [gmx-users] Why does the -append option exist?

Roland Schulz Tue, 07 Jun 2011 21:10:33 -0700

On Tue, Jun 7, 2011 at 5:21 PM, Dimitar Pachov <[email protected]> wrote:


> Hello,
>
> Just a quick update after a few shorts tests we (my colleague and I)
> quickly did. First, using
>
> "*You can emulate this yourself by calling "sleep 10s" before mdrun and
> see if that's long enough to solve the latency issue in your case.*"
>
> doesn't work for a few reasons, mainly because it doesn't seem to be a
> latency issue, but also because the load on a node is not affected by
> "sleep".
>
> However, you can reproduce the behavior I have observed pretty easily. It
> seems to be related to the values of the pointers to the *xtc, *trr, *edr,
> etc files written at the end of the checkpoint file after abrupt crashes AND
> to the frequency of access (opening) to those files. How to test:
>
> 1. In your input *mdp file put a high frequency of saving coordinates to,
> say, the *xtc (10, for example) and a low frequency for the *trr file
> (10,000, for example).
> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run)
> 3. Kill abruptly the run shortly after that (say, after 10-100 steps).
> 4. You should have a few frames written in the *xtc file, and the only one
> (the first) in the *trr file. The *cpt file should have different from zero
> values for "file_offset_low" for all of these files (the pointers have been
> updated).
>
> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
> 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay
> attention that the frequency for accessing/writing the *trr has not been
> reached.
> 7. You should have a few additional frames written in the *xtc file, while
> the *trr will still have only 1 frame (the first). The *cpt file now has
> updated all pointer values "file_offset_low", BUT the pointer to the *trr
> has acquired a value of 0. Obviously, we already now what will happen if we
> restart again from this last *cpt file.
>
> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
> 9. Kill it.
> 10. File *trr has size zero.
>
>
> Therefore, if a run is killed before the files are accessed for writing
> (depending on the chosen frequency), the file offset values reported in the
> *cpt file doesn't seem to be accordingly updated, and hence a new restart
> inevitably leads to overwritten output files.
>
> Do you think this is fixable?
>

Thanks a lot for searching for a reproducible case.

What file-system and operating system are you using? If it is a
network file-system: can you reproduce it on a non network file-system? If
not what is the OS on the client and server, and what is the network
file-system and the underlaying file-system on the server.

Thanks
Roland

>
>
>
>
>
> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <[email protected]> wrote:
>
>> Two comments about the discussion:
>>
>> 1) I agree that buffered output (Kernel buffers - not application buffers)
>> should not affect I/O. If it does it should be filed as bug to the OS. Maybe
>> someone can write a short test application which tries to reproduce this
>> idea. Thus writing to a file from one node and immediate after one test
>> program is killed on one node writing to it from some other node.
>>
>> 2) We lock files but only the log file. The idea is that we only need
>> to guarantee that the set of files is only accessed by one application. This
>> seems safe but in case someone sees a way of how the trajectory is opened
>> without the log file being opened, please file a bug.
>>
>> Roland
>>
>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <[email protected]>wrote:
>>
>>>  On 5/06/2011 11:08 PM, Francesco Oteri wrote:
>>>
>>> Dear Dimitar,
>>> I'm following the debate regarding:
>>>
>>>
>>>    The point was not "why" I was getting the restarts, but the fact
>>> itself that I was getting restarts close in time, as I stated in my first
>>> post. I actually also don't know whether jobs are deleted or suspended. I've
>>> thought that a job returned back to the queue will basically start from the
>>> beginning when later moved to an empty slot ... so don't understand the
>>> difference from that perspective.
>>>
>>>
>>> In the second mail yoo say:
>>>
>>>  Submitted by:
>>> ========================
>>> ii=1
>>> ifmpi="mpirun -np $NSLOTS"
>>> --------
>>>    if [ ! -f run${ii}-i.tpr ];then
>>>        cp run${ii}.tpr run${ii}-i.tpr
>>>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>>>    fi
>>>
>>>     k=`ls md-${ii}*.out | wc -l`
>>>    outfile="md-${ii}-$k.out"
>>>    if [[ -f run${ii}.cpt ]]; then
>>>
>>>       * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v
>>> -deffnm run${ii} -npme 0 > $outfile  2>&1
>>>
>>>     fi
>>>  =========================
>>>
>>>
>>> If I understand well, you are submitting the SERIAL  mdrun. This means
>>> that multiple instances of mdrun are running at the same time.
>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint
>>> files, one for each instance (i.e. one for each CPU),  are written at the
>>> same time.
>>>
>>>
>>> Good thought, but Dimitar's stdout excerpts from early in the thread do
>>> indicate the presence of multiple execution threads. Dynamic load balancing
>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally,
>>> and by default in the installation process, the MPI-enabled binaries get an
>>> "_mpi" suffix, but it isn't enforced - or enforceable :-)
>>>
>>> Mark
>>>
>>> --
>>>
>>> gmx-users mailing list    [email protected]
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to [email protected].
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>
>>
>>
>> --
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>> 865-241-1537, ORNL PO BOX 2008 MS6309
>>
>> --
>> gmx-users mailing list    [email protected]
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to [email protected].
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
>
>
> --
> =====================================================
> *Dimitar V Pachov*
>
> PhD Physics
> Postdoctoral Fellow
> HHMI & Biochemistry Department        Phone: (781) 736-2326
> Brandeis University, MS 057                Email: [email protected]
> =====================================================
>
>
> --
> gmx-users mailing list    [email protected]
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to [email protected].
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309

-- 
gmx-users mailing list    [email protected]
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to [email protected].
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Why does the -append option exist?

Reply via email to