On Apr 19, 2014, at 12:11 AM, TAY wee-beng <[email protected]> wrote:
> On 19/4/2014 12:10 PM, Barry Smith wrote: >> On Apr 18, 2014, at 9:57 PM, TAY wee-beng <[email protected]> wrote: >> >>> On 19/4/2014 3:53 AM, Barry Smith wrote: >>>> Hmm, >>>> >>>> Interface DMDAVecGetArrayF90 >>>> Subroutine DMDAVecGetArrayF903(da1, v,d1,ierr) >>>> USE_DM_HIDE >>>> DM_HIDE da1 >>>> VEC_HIDE v >>>> PetscScalar,pointer :: d1(:,:,:) >>>> PetscErrorCode ierr >>>> End Subroutine >>>> >>>> So the d1 is a F90 POINTER. But your subroutine seems to be treating it >>>> as a “plain old Fortran array”? >>>> real(8), intent(inout) :: u(:,:,:),v(:,:,:),w(:,:,:) > Hi, > > So d1 is a pointer, and it's different if I declare it as "plain old Fortran > array"? Because I declare it as a Fortran array and it works w/o any problem > if I only call DMDAVecGetArrayF90 and DMDAVecRestoreArrayF90 with "u". > > But if I call DMDAVecGetArrayF90 and DMDAVecRestoreArrayF90 with "u", "v" and > "w", error starts to happen. I wonder why... > > Also, supposed I call: > > call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr) > > call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr) > > call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr) > > u_array .... > > v_array .... etc > > Now to restore the array, does it matter the sequence they are restored? No it should not matter. If it matters that is a sign that memory has been written to incorrectly earlier in the code. > As in w, then v and u? > > call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr) > call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) > call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) > > thanks >>>> >>>> Note also that the beginning and end indices of the u,v,w, are >>>> different for each process see for example >>>> http://www.mcs.anl.gov/petsc/petsc-3.4/src/dm/examples/tutorials/ex11f90.F >>>> (and they do not start at 1). This is how to get the loop bounds. >>> Hi, >>> >>> In my case, I fixed the u,v,w such that their indices are the same. I also >>> checked using DMDAGetCorners and DMDAGetGhostCorners. Now the problem lies >>> in my subroutine treating it as a “plain old Fortran array”. >>> >>> If I declare them as pointers, their indices follow the C 0 start >>> convention, is that so? >> Not really. It is that in each process you need to access them from the >> indices indicated by DMDAGetCorners() for global vectors and >> DMDAGetGhostCorners() for local vectors. So really C or Fortran doesn’t >> make any difference. >> >> >>> So my problem now is that in my old MPI code, the u(i,j,k) follow the >>> Fortran 1 start convention. Is there some way to manipulate such that I do >>> not have to change my u(i,j,k) to u(i-1,j-1,k-1)? >> If you code wishes to access them with indices plus one from the values >> returned by DMDAGetCorners() for global vectors and DMDAGetGhostCorners() >> for local vectors then you need to manually subtract off the 1. >> >> Barry >> >>> Thanks. >>>> Barry >>>> >>>> On Apr 18, 2014, at 10:58 AM, TAY wee-beng <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I tried to pinpoint the problem. I reduced my job size and hence I can >>>>> run on 1 processor. Tried using valgrind but perhaps I'm using the >>>>> optimized version, it didn't catch the error, besides saying >>>>> "Segmentation fault (core dumped)" >>>>> >>>>> However, by re-writing my code, I found out a few things: >>>>> >>>>> 1. if I write my code this way: >>>>> >>>>> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr) >>>>> >>>>> call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr) >>>>> >>>>> call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr) >>>>> >>>>> u_array = .... >>>>> >>>>> v_array = .... >>>>> >>>>> w_array = .... >>>>> >>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) >>>>> >>>>> The code runs fine. >>>>> >>>>> 2. if I write my code this way: >>>>> >>>>> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr) >>>>> >>>>> call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr) >>>>> >>>>> call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr) >>>>> >>>>> call uvw_array_change(u_array,v_array,w_array) -> this subroutine does >>>>> the same modification as the above. >>>>> >>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) -> error >>>>> >>>>> where the subroutine is: >>>>> >>>>> subroutine uvw_array_change(u,v,w) >>>>> >>>>> real(8), intent(inout) :: u(:,:,:),v(:,:,:),w(:,:,:) >>>>> >>>>> u ... >>>>> v... >>>>> w ... >>>>> >>>>> end subroutine uvw_array_change. >>>>> >>>>> The above will give an error at : >>>>> >>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) >>>>> >>>>> 3. Same as above, except I change the order of the last 3 lines to: >>>>> >>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) >>>>> >>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr) >>>>> >>>>> So they are now in reversed order. Now it works. >>>>> >>>>> 4. Same as 2 or 3, except the subroutine is changed to : >>>>> >>>>> subroutine uvw_array_change(u,v,w) >>>>> >>>>> real(8), intent(inout) :: >>>>> u(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3)) >>>>> >>>>> real(8), intent(inout) :: >>>>> v(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3)) >>>>> >>>>> real(8), intent(inout) :: >>>>> w(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3)) >>>>> >>>>> u ... >>>>> v... >>>>> w ... >>>>> >>>>> end subroutine uvw_array_change. >>>>> >>>>> The start_indices and end_indices are simply to shift the 0 indices of C >>>>> convention to that of the 1 indices of the Fortran convention. This is >>>>> necessary in my case because most of my codes start array counting at 1, >>>>> hence the "trick". >>>>> >>>>> However, now no matter which order of the DMDAVecRestoreArrayF90 (as in 2 >>>>> or 3), error will occur at "call >>>>> DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) " >>>>> >>>>> So did I violate and cause memory corruption due to the trick above? But >>>>> I can't think of any way other than the "trick" to continue using the 1 >>>>> indices convention. >>>>> >>>>> Thank you. >>>>> >>>>> Yours sincerely, >>>>> >>>>> TAY wee-beng >>>>> >>>>> On 15/4/2014 8:00 PM, Barry Smith wrote: >>>>>> Try running under valgrind >>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>>>> >>>>>> >>>>>> On Apr 14, 2014, at 9:47 PM, TAY wee-beng <[email protected]> wrote: >>>>>> >>>>>>> Hi Barry, >>>>>>> >>>>>>> As I mentioned earlier, the code works fine in PETSc debug mode but >>>>>>> fails in non-debug mode. >>>>>>> >>>>>>> I have attached my code. >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Yours sincerely, >>>>>>> >>>>>>> TAY wee-beng >>>>>>> >>>>>>> On 15/4/2014 2:26 AM, Barry Smith wrote: >>>>>>>> Please send the code that creates da_w and the declarations of >>>>>>>> w_array >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> On Apr 14, 2014, at 9:40 AM, TAY wee-beng >>>>>>>> <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi Barry, >>>>>>>>> >>>>>>>>> I'm not too sure how to do it. I'm running mpi. So I run: >>>>>>>>> >>>>>>>>> mpirun -n 4 ./a.out -start_in_debugger >>>>>>>>> >>>>>>>>> I got the msg below. Before the gdb windows appear (thru x11), the >>>>>>>>> program aborts. >>>>>>>>> >>>>>>>>> Also I tried running in another cluster and it worked. Also tried in >>>>>>>>> the current cluster in debug mode and it worked too. >>>>>>>>> >>>>>>>>> mpirun -n 4 ./a.out -start_in_debugger >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> An MPI process has executed an operation involving a call to the >>>>>>>>> "fork()" system call to create a child process. Open MPI is currently >>>>>>>>> operating in a condition that could result in memory corruption or >>>>>>>>> other system errors; your MPI job may hang, crash, or produce silent >>>>>>>>> data corruption. The use of fork() (or system() or other calls that >>>>>>>>> create child processes) is strongly discouraged. >>>>>>>>> >>>>>>>>> The process that invoked fork was: >>>>>>>>> >>>>>>>>> Local host: n12-76 (PID 20235) >>>>>>>>> MPI_COMM_WORLD rank: 2 >>>>>>>>> >>>>>>>>> If you are *absolutely sure* that your application will successfully >>>>>>>>> and correctly survive a call to fork(), you may disable this warning >>>>>>>>> by setting the mpi_warn_on_fork MCA parameter to 0. >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> [2]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20235 on >>>>>>>>> display localhost:50.0 on machine n12-76 >>>>>>>>> [0]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20233 on >>>>>>>>> display localhost:50.0 on machine n12-76 >>>>>>>>> [1]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20234 on >>>>>>>>> display localhost:50.0 on machine n12-76 >>>>>>>>> [3]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20236 on >>>>>>>>> display localhost:50.0 on machine n12-76 >>>>>>>>> [n12-76:20232] 3 more processes have sent help message >>>>>>>>> help-mpi-runtime.txt / mpi_init:warn-fork >>>>>>>>> [n12-76:20232] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>>>>>> see all help / error messages >>>>>>>>> >>>>>>>>> .... >>>>>>>>> >>>>>>>>> 1 >>>>>>>>> [1]PETSC ERROR: >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>>>>>>> probably memory access out of range >>>>>>>>> [1]PETSC ERROR: Try option -start_in_debugger or >>>>>>>>> -on_error_attach_debugger >>>>>>>>> [1]PETSC ERROR: or see >>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC >>>>>>>>> ERROR: or try http://valgrind.org >>>>>>>>> on GNU/linux and Apple Mac OS X to find memory corruption errors >>>>>>>>> [1]PETSC ERROR: configure using --with-debugging=yes, recompile, >>>>>>>>> link, and run >>>>>>>>> [1]PETSC ERROR: to get more information on the crash. >>>>>>>>> [1]PETSC ERROR: User provided function() line 0 in unknown directory >>>>>>>>> unknown file (null) >>>>>>>>> [3]PETSC ERROR: >>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>> [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>>>>>>> probably memory access out of range >>>>>>>>> [3]PETSC ERROR: Try option -start_in_debugger or >>>>>>>>> -on_error_attach_debugger >>>>>>>>> [3]PETSC ERROR: or see >>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSC >>>>>>>>> ERROR: or try http://valgrind.org >>>>>>>>> on GNU/linux and Apple Mac OS X to find memory corruption errors >>>>>>>>> [3]PETSC ERROR: configure using --with-debugging=yes, recompile, >>>>>>>>> link, and run >>>>>>>>> [3]PETSC ERROR: to get more information on the crash. >>>>>>>>> [3]PETSC ERROR: User provided function() line 0 in unknown directory >>>>>>>>> unknown file (null) >>>>>>>>> >>>>>>>>> ... >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>> Yours sincerely, >>>>>>>>> >>>>>>>>> TAY wee-beng >>>>>>>>> >>>>>>>>> On 14/4/2014 9:05 PM, Barry Smith wrote: >>>>>>>>> >>>>>>>>>> Because IO doesn’t always get flushed immediately it may not be >>>>>>>>>> hanging at this point. It is better to use the option >>>>>>>>>> -start_in_debugger then type cont in each debugger window and then >>>>>>>>>> when you think it is “hanging” do a control C in each debugger >>>>>>>>>> window and type where to see where each process is you can also look >>>>>>>>>> around in the debugger at variables to see why it is “hanging” at >>>>>>>>>> that point. >>>>>>>>>> >>>>>>>>>> Barry >>>>>>>>>> >>>>>>>>>> This routines don’t have any parallel communication in them so are >>>>>>>>>> unlikely to hang. >>>>>>>>>> >>>>>>>>>> On Apr 14, 2014, at 6:52 AM, TAY wee-beng >>>>>>>>>> >>>>>>>>>> <[email protected]> >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> My code hangs and I added in mpi_barrier and print to catch the >>>>>>>>>>> bug. I found that it hangs after printing "7". Is it because I'm >>>>>>>>>>> doing something wrong? I need to access the u,v,w array so I use >>>>>>>>>>> DMDAVecGetArrayF90. After access, I use DMDAVecRestoreArrayF90. >>>>>>>>>>> >>>>>>>>>>> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr) >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"3" >>>>>>>>>>> call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr) >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"4" >>>>>>>>>>> call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr) >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"5" >>>>>>>>>>> call >>>>>>>>>>> I_IIB_uv_initial_1st_dm(I_cell_no_u1,I_cell_no_v1,I_cell_no_w1,I_cell_u1,I_cell_v1,I_cell_w1,u_array,v_array,w_array) >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"6" >>>>>>>>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr) >>>>>>>>>>> !must be in reverse order >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"7" >>>>>>>>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) >>>>>>>>>>> call MPI_Barrier(MPI_COMM_WORLD,ierr); if (myid==0) print >>>>>>>>>>> *,"8" >>>>>>>>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) >>>>>>>>>>> -- >>>>>>>>>>> Thank you. >>>>>>>>>>> >>>>>>>>>>> Yours sincerely, >>>>>>>>>>> >>>>>>>>>>> TAY wee-beng >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>> <code.txt>
