On Apr 19, 2014, at 12:11 AM, TAY wee-beng <[email protected]> wrote:

> On 19/4/2014 12:10 PM, Barry Smith wrote:
>> On Apr 18, 2014, at 9:57 PM, TAY wee-beng <[email protected]> wrote:
>> 
>>> On 19/4/2014 3:53 AM, Barry Smith wrote:
>>>>   Hmm,
>>>> 
>>>>       Interface DMDAVecGetArrayF90
>>>>         Subroutine DMDAVecGetArrayF903(da1, v,d1,ierr)
>>>>           USE_DM_HIDE
>>>>           DM_HIDE da1
>>>>           VEC_HIDE v
>>>>           PetscScalar,pointer :: d1(:,:,:)
>>>>           PetscErrorCode ierr
>>>>         End Subroutine
>>>> 
>>>>    So the d1 is a F90 POINTER. But your subroutine seems to be treating it 
>>>> as a “plain old Fortran array”?
>>>> real(8), intent(inout) :: u(:,:,:),v(:,:,:),w(:,:,:)
> Hi,
> 
> So d1 is a pointer, and it's different if I declare it as "plain old Fortran 
> array"? Because I declare it as a Fortran array and it works w/o any problem 
> if I only call DMDAVecGetArrayF90 and DMDAVecRestoreArrayF90 with "u".
> 
> But if I call DMDAVecGetArrayF90 and DMDAVecRestoreArrayF90 with "u", "v" and 
> "w", error starts to happen. I wonder why...
> 
> Also, supposed I call:
> 
> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr)
> 
>    call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr)
> 
>    call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr)
> 
> u_array ....
> 
> v_array .... etc
> 
> Now to restore the array, does it matter the sequence they are restored?

   No it should not matter. If it matters that is a sign that memory has been 
written to incorrectly earlier in the code.


> As in w, then v and u?
> 
> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr)
> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr)
> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr)
> 
> thanks
>>>> 
>>>>    Note also that the beginning and end indices of the u,v,w, are 
>>>> different for each process see for example 
>>>> http://www.mcs.anl.gov/petsc/petsc-3.4/src/dm/examples/tutorials/ex11f90.F 
>>>>  (and they do not start at 1). This is how to get the loop bounds.
>>> Hi,
>>> 
>>> In my case, I fixed the u,v,w such that their indices are the same. I also 
>>> checked using DMDAGetCorners and DMDAGetGhostCorners. Now the problem lies 
>>> in my subroutine treating it as a “plain old Fortran array”.
>>> 
>>> If I declare them as pointers, their indices follow the C 0 start 
>>> convention, is that so?
>>    Not really. It is that in each process you need to access them from the 
>> indices indicated by DMDAGetCorners() for global vectors and 
>> DMDAGetGhostCorners() for local vectors.  So really C or Fortran doesn’t 
>> make any difference.
>> 
>> 
>>> So my problem now is that in my old MPI code, the u(i,j,k) follow the 
>>> Fortran 1 start convention. Is there some way to manipulate such that I do 
>>> not have to change my u(i,j,k) to u(i-1,j-1,k-1)?
>>   If you code wishes to access them with indices plus one from the values 
>> returned by DMDAGetCorners() for global vectors and DMDAGetGhostCorners() 
>> for local vectors then you need to manually subtract off the 1.
>> 
>>   Barry
>> 
>>> Thanks.
>>>>   Barry
>>>> 
>>>> On Apr 18, 2014, at 10:58 AM, TAY wee-beng <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I tried to pinpoint the problem. I reduced my job size and hence I can 
>>>>> run on 1 processor. Tried using valgrind but perhaps I'm using the 
>>>>> optimized version, it didn't catch the error, besides saying 
>>>>> "Segmentation fault (core dumped)"
>>>>> 
>>>>> However, by re-writing my code, I found out a few things:
>>>>> 
>>>>> 1. if I write my code this way:
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr)
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr)
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr)
>>>>> 
>>>>> u_array = ....
>>>>> 
>>>>> v_array = ....
>>>>> 
>>>>> w_array = ....
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr)
>>>>> 
>>>>> The code runs fine.
>>>>> 
>>>>> 2. if I write my code this way:
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr)
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr)
>>>>> 
>>>>> call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr)
>>>>> 
>>>>> call uvw_array_change(u_array,v_array,w_array) -> this subroutine does 
>>>>> the same modification as the above.
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr) -> error
>>>>> 
>>>>> where the subroutine is:
>>>>> 
>>>>> subroutine uvw_array_change(u,v,w)
>>>>> 
>>>>> real(8), intent(inout) :: u(:,:,:),v(:,:,:),w(:,:,:)
>>>>> 
>>>>> u ...
>>>>> v...
>>>>> w ...
>>>>> 
>>>>> end subroutine uvw_array_change.
>>>>> 
>>>>> The above will give an error at :
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr)
>>>>> 
>>>>> 3. Same as above, except I change the order of the last 3 lines to:
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr)
>>>>> 
>>>>> call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr)
>>>>> 
>>>>> So they are now in reversed order. Now it works.
>>>>> 
>>>>> 4. Same as 2 or 3, except the subroutine is changed to :
>>>>> 
>>>>> subroutine uvw_array_change(u,v,w)
>>>>> 
>>>>> real(8), intent(inout) :: 
>>>>> u(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3))
>>>>> 
>>>>> real(8), intent(inout) :: 
>>>>> v(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3))
>>>>> 
>>>>> real(8), intent(inout) :: 
>>>>> w(start_indices(1):end_indices(1),start_indices(2):end_indices(2),start_indices(3):end_indices(3))
>>>>> 
>>>>> u ...
>>>>> v...
>>>>> w ...
>>>>> 
>>>>> end subroutine uvw_array_change.
>>>>> 
>>>>> The start_indices and end_indices are simply to shift the 0 indices of C 
>>>>> convention to that of the 1 indices of the Fortran convention. This is 
>>>>> necessary in my case because most of my codes start array counting at 1, 
>>>>> hence the "trick".
>>>>> 
>>>>> However, now no matter which order of the DMDAVecRestoreArrayF90 (as in 2 
>>>>> or 3), error will occur at "call 
>>>>> DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr) "
>>>>> 
>>>>> So did I violate and cause memory corruption due to the trick above? But 
>>>>> I can't think of any way other than the "trick" to continue using the 1 
>>>>> indices convention.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 15/4/2014 8:00 PM, Barry Smith wrote:
>>>>>>   Try running under valgrind 
>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>>>>> 
>>>>>> 
>>>>>> On Apr 14, 2014, at 9:47 PM, TAY wee-beng <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi Barry,
>>>>>>> 
>>>>>>> As I mentioned earlier, the code works fine in PETSc debug mode but 
>>>>>>> fails in non-debug mode.
>>>>>>> 
>>>>>>> I have attached my code.
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 15/4/2014 2:26 AM, Barry Smith wrote:
>>>>>>>>   Please send the code that creates da_w and the declarations of 
>>>>>>>> w_array
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> On Apr 14, 2014, at 9:40 AM, TAY wee-beng
>>>>>>>> <[email protected]>
>>>>>>>>  wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi Barry,
>>>>>>>>> 
>>>>>>>>> I'm not too sure how to do it. I'm running mpi. So I run:
>>>>>>>>> 
>>>>>>>>>  mpirun -n 4 ./a.out -start_in_debugger
>>>>>>>>> 
>>>>>>>>> I got the msg below. Before the gdb windows appear (thru x11), the 
>>>>>>>>> program aborts.
>>>>>>>>> 
>>>>>>>>> Also I tried running in another cluster and it worked. Also tried in 
>>>>>>>>> the current cluster in debug mode and it worked too.
>>>>>>>>> 
>>>>>>>>> mpirun -n 4 ./a.out -start_in_debugger
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> An MPI process has executed an operation involving a call to the
>>>>>>>>> "fork()" system call to create a child process.  Open MPI is currently
>>>>>>>>> operating in a condition that could result in memory corruption or
>>>>>>>>> other system errors; your MPI job may hang, crash, or produce silent
>>>>>>>>> data corruption.  The use of fork() (or system() or other calls that
>>>>>>>>> create child processes) is strongly discouraged.
>>>>>>>>> 
>>>>>>>>> The process that invoked fork was:
>>>>>>>>> 
>>>>>>>>>   Local host:          n12-76 (PID 20235)
>>>>>>>>>   MPI_COMM_WORLD rank: 2
>>>>>>>>> 
>>>>>>>>> If you are *absolutely sure* that your application will successfully
>>>>>>>>> and correctly survive a call to fork(), you may disable this warning
>>>>>>>>> by setting the mpi_warn_on_fork MCA parameter to 0.
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> [2]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20235 on 
>>>>>>>>> display localhost:50.0 on machine n12-76
>>>>>>>>> [0]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20233 on 
>>>>>>>>> display localhost:50.0 on machine n12-76
>>>>>>>>> [1]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20234 on 
>>>>>>>>> display localhost:50.0 on machine n12-76
>>>>>>>>> [3]PETSC ERROR: PETSC: Attaching gdb to ./a.out of pid 20236 on 
>>>>>>>>> display localhost:50.0 on machine n12-76
>>>>>>>>> [n12-76:20232] 3 more processes have sent help message 
>>>>>>>>> help-mpi-runtime.txt / mpi_init:warn-fork
>>>>>>>>> [n12-76:20232] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>>>>>>> see all help / error messages
>>>>>>>>> 
>>>>>>>>> ....
>>>>>>>>> 
>>>>>>>>>  1
>>>>>>>>> [1]PETSC ERROR: 
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>>>>>>>>> probably memory access out of range
>>>>>>>>> [1]PETSC ERROR: Try option -start_in_debugger or 
>>>>>>>>> -on_error_attach_debugger
>>>>>>>>> [1]PETSC ERROR: or see
>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC 
>>>>>>>>> ERROR: or try http://valgrind.org
>>>>>>>>>  on GNU/linux and Apple Mac OS X to find memory corruption errors
>>>>>>>>> [1]PETSC ERROR: configure using --with-debugging=yes, recompile, 
>>>>>>>>> link, and run
>>>>>>>>> [1]PETSC ERROR: to get more information on the crash.
>>>>>>>>> [1]PETSC ERROR: User provided function() line 0 in unknown directory 
>>>>>>>>> unknown file (null)
>>>>>>>>> [3]PETSC ERROR: 
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>>>>>>>>> probably memory access out of range
>>>>>>>>> [3]PETSC ERROR: Try option -start_in_debugger or 
>>>>>>>>> -on_error_attach_debugger
>>>>>>>>> [3]PETSC ERROR: or see
>>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[3]PETSC 
>>>>>>>>> ERROR: or try http://valgrind.org
>>>>>>>>>  on GNU/linux and Apple Mac OS X to find memory corruption errors
>>>>>>>>> [3]PETSC ERROR: configure using --with-debugging=yes, recompile, 
>>>>>>>>> link, and run
>>>>>>>>> [3]PETSC ERROR: to get more information on the crash.
>>>>>>>>> [3]PETSC ERROR: User provided function() line 0 in unknown directory 
>>>>>>>>> unknown file (null)
>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Yours sincerely,
>>>>>>>>> 
>>>>>>>>> TAY wee-beng
>>>>>>>>> 
>>>>>>>>> On 14/4/2014 9:05 PM, Barry Smith wrote:
>>>>>>>>> 
>>>>>>>>>>   Because IO doesn’t always get flushed immediately it may not be 
>>>>>>>>>> hanging at this point.  It is better to use the option 
>>>>>>>>>> -start_in_debugger then type cont in each debugger window and then 
>>>>>>>>>> when you think it is “hanging” do a control C in each debugger 
>>>>>>>>>> window and type where to see where each process is you can also look 
>>>>>>>>>> around in the debugger at variables to see why it is “hanging” at 
>>>>>>>>>> that point.
>>>>>>>>>> 
>>>>>>>>>>    Barry
>>>>>>>>>> 
>>>>>>>>>>   This routines don’t have any parallel communication in them so are 
>>>>>>>>>> unlikely to hang.
>>>>>>>>>> 
>>>>>>>>>> On Apr 14, 2014, at 6:52 AM, TAY wee-beng
>>>>>>>>>> 
>>>>>>>>>> <[email protected]>
>>>>>>>>>> 
>>>>>>>>>>  wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> My code hangs and I added in mpi_barrier and print to catch the 
>>>>>>>>>>> bug. I found that it hangs after printing "7". Is it because I'm 
>>>>>>>>>>> doing something wrong? I need to access the u,v,w array so I use 
>>>>>>>>>>> DMDAVecGetArrayF90. After access, I use DMDAVecRestoreArrayF90.
>>>>>>>>>>> 
>>>>>>>>>>>         call DMDAVecGetArrayF90(da_u,u_local,u_array,ierr)
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"3"
>>>>>>>>>>>         call DMDAVecGetArrayF90(da_v,v_local,v_array,ierr)
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"4"
>>>>>>>>>>>         call DMDAVecGetArrayF90(da_w,w_local,w_array,ierr)
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"5"
>>>>>>>>>>>         call 
>>>>>>>>>>> I_IIB_uv_initial_1st_dm(I_cell_no_u1,I_cell_no_v1,I_cell_no_w1,I_cell_u1,I_cell_v1,I_cell_w1,u_array,v_array,w_array)
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"6"
>>>>>>>>>>>         call DMDAVecRestoreArrayF90(da_w,w_local,w_array,ierr)  
>>>>>>>>>>> !must be in reverse order
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"7"
>>>>>>>>>>>         call DMDAVecRestoreArrayF90(da_v,v_local,v_array,ierr)
>>>>>>>>>>>         call MPI_Barrier(MPI_COMM_WORLD,ierr);  if (myid==0) print 
>>>>>>>>>>> *,"8"
>>>>>>>>>>>         call DMDAVecRestoreArrayF90(da_u,u_local,u_array,ierr)
>>>>>>>>>>> -- 
>>>>>>>>>>> Thank you.
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> <code.txt>

Reply via email to