Hi Gilles,

I believe I have found the problem. Initially I thought it may have been an
mpi issue since it was internally within an mpi function. However, now I am
sure that the problem has to do with an overflow of 4-byte signed integers.

I am dealing with computational domains that have a little more than a
billion cells (1024^3 cells). However, I am still within the limits of the
4 byte integer. The area where I am running into the problem is here I have
shortened the code,

 ! Fileviews
integer :: fileview_hexa
integer :: fileview_hexa_conn

integer, dimension(:), allocatable :: blocklength
integer, dimension(:), allocatable :: map
integer(KIND=8) :: size
integer(KIND=MPI_OFFSET_KIND) :: disp   ! MPI_OFFSET_KIND seems to be
8-bytes

allocate(map(ncells_hexa_),blocklength(ncells_hexa_))
map = hexa_-1
hexa_ is a 4-byte array of integers that label local hexa elements at a
given rank. The max this number can be is in my current code (1024^3). and
min is 1.
blocklength = 1
call
MPI_TYPE_INDEXED(ncells_hexa_,blocklength,map,MPI_REAL_SP,fileview_hexa,ierr)
MPI_REAL_SP is used for 4-byte scalar data types that are going to be
written to the file. (i.e. temperature scalar stored at a given hexa cell)
call MPI_TYPE_COMMIT(fileview_hexa,ierr)
map = map * 8
here is where problems arise. The map is being multiplied by 8 because the
hexa cell node connectivity needs to be written. The node numbering that is
being written to the file needs to be 4-bytes, and the max node numbering
is able to be held within the 4-byte signed integer. But since I have to
map 8*1024^3 displacements to be written map needs to be integer(kind=8).
blocklength = 8
call
MPI_TYPE_INDEXED(ncells_hexa_,blocklength,map,MPI_INTEGER,fileview_hexa_conn,ierr)
MPI_TYPE_INDEXED( 1024^3,  blocklength=(/8 8 8 8 8 ..... 8 8 /), map=(/0,
8, 16, 24, ..... , 8589934592/), MPI_INTEGER, file_view_hexa_conn, ierr)
Would this be a correct way to declare the newdatatype file_view_hexa_conn.
in this call blocklength would be a 4-byte integer array and map would be
an 8-byte integer array. To be clear, in the code currently has both map
and blocklength as 4-bytes integer arrays.
call MPI_TYPE_COMMIT(fileview_hexa_conn,ierr)
deallocate(map,blocklength)

....

disp = disp+84
call
MPI_FILE_SET_VIEW(iunit,disp,MPI_INTEGER,fileview_hexa,"native",MPI_INFO_NULL,ierr)
call MPI_FILE_WRITE_ALL(iunit,hexa_,ncells_hexa_,MPI_INTEGER,status,ierr)
I believe this could be wrong as well but the fileview_hexa is being used
to write the 4-byte integer hexa labeling, but as you said MPI_REAL_SP =
MPI_INTEGER = 4-byte may be fine. It has not given any problems thus far.
disp = disp+4*ncells_hexa
call
MPI_FILE_SET_VIEW(iunit,disp,MPI_INTEGER,fileview_hexa_conn,"native",MPI_INFO_NULL,ierr)
size = 8*ncells_hexa_
call MPI_FILE_WRITE_ALL(iunit,conn_hexa,size,MPI_INTEGER,status,ierr)


Hopefully that is enough information about the issue. Now my questions.

   1. Does this implementation look correct.
   2. What kind should fileview_hexa and fileview_hexa_conn be?
   3. Is it okay that map and blocklength are different integer kinds?
   4. What does the blocklength parameter specify exactly. I played with
   this some and changing the blocklength did not seem to change anything

Thanks for the help.

-Dominic Kedelty

On Wed, Mar 16, 2016 at 12:02 AM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Dominic,
>
> at first, you might try to add
> call MPI_Barrier(comm,ierr)
> between
>
>   if (file_is_there .and. irank.eq.iroot) call
> MPI_FILE_DELETE(file,MPI_INFO_NULL,ierr)
>
> and
>
>   call
> MPI_FILE_OPEN(comm,file,IOR(MPI_MODE_WRONLY,MPI_MODE_CREATE),MPI_INFO_NULL,iunit,ierr)
>
> /* there might me a race condition, i am not sure about that */
>
>
> fwiw, the
>
> STOP A configuration file is required
>
> error message is not coming from OpenMPI.
> it might be indirectly triggered by an ompio bug/limitation, or even a bug
> in your application.
> did you get your application work with an other flavor of OpenMPI ?
> e.g. are you reporting an OpenMPI bug ?
> or are you asking some help with your application (the bug could either be
> in your code or in OpenMPI, and you do not know for sure)
>
> i am a bit surprised you are using the same fileview_node type with both
> MPI_INTEGER and MPI_REAL_SP, but since they should be the same size, that
> might not be an issue.
>
> the subroutine depends on too many external parameters
> (nnodes_, fileview_node, ncells_hexa, ncells_hexa_, unstr2str, ...)
> so writing a simple reproducer might not be trivial.
>
> i recommend you first write a self contained program that can be evidenced
> to reproduce the issue,
> and then i will investigate that. for that, you might want to dump the
> array sizes and the description of fileview_node in your application, and
> then hard code them into your self contained program.
> also how many nodes/tasks are you running and what filesystem are you
> running on ?
>
> Cheers,
>
> Gilles
>
>
> On 3/16/2016 3:05 PM, Dominic Kedelty wrote:
>
> Gilles,
>
> I do not have the latest mpich available. I tested using openmpi version
> 1.8.7 as well as mvapich2 version 1.9. both produced similar errors. I
> tried the mca flag that you had provided and it is telling me that a
> configuration file is needed.
>
> all processes return:
>
> STOP A configuration file is required
>
> I am attaching the subroutine of the code that I believe is where the
> problem is occuring.
>
>
>
> On Mon, Mar 14, 2016 at 6:25 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Dominic,
>>
>> this is a ROMIO error message, and ROMIO is from MPICH project.
>> at first, I recommend you try the same test with the latest mpich, in
>> order to check
>> whether the bug is indeed from romio, and has been fixed in the latest
>> release.
>> (ompi is a few version behind the latest romio)
>>
>> would you be able to post a trimmed version of your application that
>> evidences the test ?
>> that will be helpful to understand what is going on.
>>
>> you might also want to give a try to
>> mpirun --mca io ompio ...
>> and see whether this helps.
>> that being said, I think ompio is not considered as production ready on
>> the v1.10 series of ompi
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Tuesday, March 15, 2016, Dominic Kedelty < <dkede...@asu.edu>
>> dkede...@asu.edu> wrote:
>>
>>> I am getting the following error using openmpi and I am wondering if
>>> anyone would have clue as to why it is happening. It is an error coming
>>> from openmpi.
>>>
>>> Error in ADIOI_Calc_aggregator(): rank_index(40) >= fd->hints->cb_nodes
>>> (40) fd_size=213909504 off=8617247540
>>> Error in ADIOI_Calc_aggregator(): rank_index(40) >= fd->hints->cb_nodes
>>> (40) fd_size=213909504 off=8617247540
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 157
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 477
>>>
>>> Any help would be appreciated. Thanks.
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/03/18697.php
>>
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/03/18700.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/03/18701.php
>

Reply via email to