Dear All, I just created a public git repository for some section of the code which causes the problem,
https://bitbucket.org/FatemehChe/reaction_diffusion/src/master/ Shortly speaking, this code is part of a PDE-Constrained optimization problem, and this function is called at each iteration of an iterative solver. As I mentioned before it randomly gives us this error "Segmentation Violation, probably memory access out of range". That's why I run it this function 10k times to show you the error, you can see the error in the file called result/mono Although, I am able to run it on my local machine (Mac/OS, CPU 2.5 GHz Intel Core i7) with 4 cores, without any problem, whenever I run the project on the cluster with more cores it gives me that error after some iterations. To run the project on the cluster, - make - sh benchmark Thanks in advance. Best regards, Fatemeh > > > ---------- Forwarded message --------- > From: John Peterson <jwpeter...@gmail.com <mailto:jwpeter...@gmail.com>> > Date: Tue, Nov 13, 2018 at 4:07 PM > Subject: Re: [Libmesh]: In parallel error :Segmentation Violation, probably > memory access out of range > To: Fatemeh Chegini Salzmann <chegini.fat...@gmail.com > <mailto:chegini.fat...@gmail.com>> > > > Hi Fatemeh, > > Two requests: > > 1.) please provide your entire application code if possible, we can't debug > code snippets > 2.) please send help requests like this to libmesh-us...@lists.sf.net > <mailto:libmesh-us...@lists.sf.net> or open an issue on GitHub. That way you > will be more likely to get help people with more time and/or insight than me. > > -- > John > > > On Tue, Nov 13, 2018 at 3:38 AM Fatemeh Chegini Salzmann > <chegini.fat...@gmail.com <mailto:chegini.fat...@gmail.com>> wrote: > Dear John, > > I have couple of transient PDEs that need to be solved at each time > iteration, where the solution of one PDE is used to solve the next PDE. > As you can see shortly here: > // > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > system_gateVariable.time = 0; > system_diffusion.time = 0; > system_reaction.time = 0 > unsigned int t_step; > for (t_step=1; t_step<(T); t_step++) > { > > system_gateVariable.time += dt_T; > system_diffusion.time += dt_T; > system_reaction.time += dt_T; > > *system_gateVariable.old_local_solution = > *system_gateVariable.current_local_solution; > *system_reaction.old_local_solution = > *system_reaction.current_local_solution; > *system_diffusion.old_local_solution = > *system_diffusion.current_local_solution; > > equation_systems.get_system("gateVariable").solve(); > equation_systems.get_system("reaction").solve(); > > *system_diffusion.old_local_solution = > *system_reaction.current_local_solution; > equation_systems.get_system("diffusion").solve(); > } > // > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > and in the assembly function, I called the old_solution whenever i need to > read the solution of each PDE. as follows: > system.old_solution(dof_indices[l]);, > > I have this error when I run the project in parallel mode "Caught signal > number 11 SEGV: Segmentation Violation, probably memory access out of range" > I even tried to use close() after solving each equation, e.g. > equation_systems.get_system("reaction").solution.close(); which didn't help. > I don't know how to deal with this problem. > > I would appreciate any help/suggetion. > > Thank in advance, > Fatemeh > > FYI: > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [1]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > <http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> > [1]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on > GNU/linux and Apple Mac OS X to find memory corruption errors > [1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run > [1]PETSC ERROR: to get more information on the crash. > [1]PETSC ERROR: --------------------- Error Message > -------------------------------------------------------------- > [1]PETSC ERROR: Signal received > [1]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html > <http://www.mcs.anl.gov/petsc/documentation/faq.html> for trouble shooting. > [1]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017 > [1]PETSC ERROR: > /home/chegini/inverseProblem/LibmeshCode_InParallel_Cluster/./assemble on a > arch-linux2-c-opt named icsnode17 by chegini Tue Nov 13 10:56:57 2018 > [1]PETSC ERROR: Configure options --prefix=/apps/petsc/3.8.3 > --download-hypre=1 --with-ssl=0 --with-debugging=no --with-pic=1 > --with-shared-libraries=1 --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 > --download-fblaslapack=1 --download-metis=1 --download-parmetis=1 > --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 > --CC=mpicc --CXX=mpicxx --FC=mpif90 --F77=mpif77 --F90=mpif90 --CFLAGS="-fPIC > -fopenmp" --CXXFLAGS="-fPIC -fopenmp" --FFLAGS="-fPIC -fopenmp" > --FCFLAGS="-fPIC -fopenmp" --F90FLAGS="-fPIC -fopenmp" --F77FLAGS="-fPIC > -fopenmp" PETSC_DIR=/apps/petsc/3.8.3/src/petsc-3.8.3 > [1]PETSC ERROR: #1 User provided function() line 0 in unknown file > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD > with errorcode 59. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > In: PMI_Abort(59, N/A) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: *** STEP 924595.0 ON icsnode17 CANCELLED AT 2018-11-13T11:00:55 > *** > srun: error: icsnode17: tasks 0,3,5-16,18-19: Killed > srun: error: icsnode17: task 1: Exited with exit code 59 > srun: error: icsnode17: tasks 2,4,17: Killed > > > > _______________________________________________ Libmesh-users mailing list Libmesh-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-users