Hi

I have build deal.ii on the Nvidia Jetson Nano cluster from Picocluster

It has 5 nodes (pc[0-4], I am using pc0 as the head/login node to launch 
the application

The executable is in an NFS filesystem to share between all the nodes

The following works

mpirun --host pc1 --mca btl_tcp_if_include 192.168.0.0/24 --mca btl 
tcp,self 
/nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release

However, when attempting to run it on more than one hosts fails

mpirun --host pc1,pc2 --mca btl_tcp_if_include 192.168.0.0/24 --mca btl 
tcp,self 
/nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release

It seems to be consistently failing when writing the checkpoint file(s)

Are there special flags I need to setup up for some form of parallel IO 
that may be happening ?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    ####################################################
    #########                                  #########
    #########      Cycle  000040  (0.5%)       #########
    #########      at time t = 0.01975410      #########
    #########                                  #########
    ####################################################


    ####################################################
    #########                                  #########
    #########      checkpoint computation      #########
    #########                                  #########
    #########                                  #########
    ####################################################

[pc1:07535] mca_sharedfp_individual_file_open: Error during datafile file 
open
[pc2:07602] mca_sharedfp_individual_file_open: Error during datafile file 
open


----------------------------------------------------
Exception on processing: 
---------------------------------------------------------
TimerOutput objects finalize timed values printed to the
screen by communicating over MPI in their destructors.
Since an exception is currently uncaught, this
synchronization (and subsequent output) will be skipped
to avoid a possible deadlock.
---------------------------------------------------------


----------------------------------------------------
Exception on processing: 

--------------------------------------------------------
An error occurred in line <1412> of file 
</home/picocluster/projects/dealii/dealii/dealii_git/source/distributed/tria_base.cc>
 
in function
    void dealii::parallel::DistributedTriangulationBase<dim, 
spacedim>::DataTransfer::save(unsigned int, unsigned int, const string&) 
const [with int dim = 2; int spacedim = 2; std::__cxx11::string = 
std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
    deal.II encountered an error while calling an MPI function.
    The description of the error provided by MPI is "MPI_ERR_FILE: invalid
    file".
    The numerical value of the original error code is 30.

Stacktrace:
-----------
#0  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::DistributedTriangulationBase<2, 
2>::DataTransfer::save(unsigned int, unsigned int, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#1  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::DistributedTriangulationBase<2, 
2>::save_attached_data(unsigned int, unsigned int, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#2  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::distributed::Triangulation<2, 
2>::save(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#3  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
Step69::MainLoop<2>::checkpoint(std::array<dealii::LinearAlgebra::distributed::Vector<double,
 
dealii::MemorySpace::Host>, 4ul> const&, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, double, unsigned int)
#4  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
Step69::MainLoop<2>::run()
#5  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
main
--------------------------------------------------------

Aborting!
----------------------------------------------------

--------------------------------------------------------
An error occurred in line <1412> of file 
</home/picocluster/projects/dealii/dealii/dealii_git/source/distributed/tria_base.cc>
 
in function
    void dealii::parallel::DistributedTriangulationBase<dim, 
spacedim>::DataTransfer::save(unsigned int, unsigned int, const string&) 
const [with int dim = 2; int spacedim = 2; std::__cxx11::string = 
std::__cxx11::basic_string<char>]
The violated condition was: 
    ierr == MPI_SUCCESS
Additional information: 
    deal.II encountered an error while calling an MPI function.
    The description of the error provided by MPI is "MPI_ERR_FILE: invalid
    file".
    The numerical value of the original error code is 30.

Stacktrace:
-----------
#0  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::DistributedTriangulationBase<2, 
2>::DataTransfer::save(unsigned int, unsigned int, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#1  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::DistributedTriangulationBase<2, 
2>::save_attached_data(unsigned int, unsigned int, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#2  /nfs/systems/dealii/head-bost_1_70_0/lib/libdeal_II.so.10.0.0-pre: 
dealii::parallel::distributed::Triangulation<2, 
2>::save(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&) const
#3  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
Step69::MainLoop<2>::checkpoint(std::array<dealii::LinearAlgebra::distributed::Vector<double,
 
dealii::MemorySpace::Host>, 4ul> const&, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, double, unsigned int)
#4  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
Step69::MainLoop<2>::run()
#5  /nfs/systems/dealii/head-bost_1_70_0/examples/step-69/step-69.release: 
main
--------------------------------------------------------

Aborting!
----------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, 
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[8910,1],1]
  Exit code:    1
--------------------------------------------------------------------------

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/d248585f-aa82-43c3-b6db-2698a975f959n%40googlegroups.com.

Reply via email to