Hi William,

Does the directory specified in the error message (`savedFilePath`)
exist?  If yes, then could you please retry after define-ing the
`STAMPEDE_LUSTRE_FIX` macro in src/util_misc.cpp? Add `#define
STAMPEDE_LUSTRE_FIX 1` at the top of the file (perhaps after all
the includes). We saw a related issue on Stampede, I wonder if what
you are seeing is the same thing.

Thanks,
Rohan

On Tue, Apr 26, 2016 at 11:34:41AM -0700, William Fox wrote:
> I am running dmtcp on a hpc and am hoping to run a distributed application
> across several nodes and checkpoint/restart it.
> 
> When I run with the --rm plugin in the torque environment the application
> runs into an error (I believe directly after dmtcp_launch --rm):
> [40000] ERROR at fileconnection.cpp:619 in preCkpt;
> REASON='JASSERT(Util::createDirectoryTree(savedFilePath)) failed'
>      savedFilePath = /oasis/scratch/<checkpoint_directory>
> Message: Unable to create directory in File Path
> python2.7 (40000): Terminating...
> 
> If I run without --rm then the host names are not adapted on a restart and
> the application fails.
> 
> Running a helloworld counting program that says the host files every 60
> seconds via openmpi runs smoothly with --rm and --infiniband plugins on the
> same system.  I have tried to track down similar errors in the forums but
> failed to find instances of distributed systems withing the hpc
> environment.  The checkpoints occur in the same directory for both the
> openmpi and distributed application so it is not the folder permissions.
> 
> Any ideas?

> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications Manager
> Applications Manager provides deep performance insights into multiple tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to