Hi Gilles, I believe I have found the problem. Initially I thought it may have been an mpi issue since it was internally within an mpi function. However, now I am sure that the problem has to do with an overflow of 4-byte signed integers.
I am dealing with computational domains that have a little more than a billion cells (1024^3 cells). However, I am still within the limits of the 4 byte integer. The area where I am running into the problem is here I have shortened the code, ! Fileviews integer :: fileview_hexa integer :: fileview_hexa_conn integer, dimension(:), allocatable :: blocklength integer, dimension(:), allocatable :: map integer(KIND=8) :: size integer(KIND=MPI_OFFSET_KIND) :: disp ! MPI_OFFSET_KIND seems to be 8-bytes allocate(map(ncells_hexa_),blocklength(ncells_hexa_)) map = hexa_-1 hexa_ is a 4-byte array of integers that label local hexa elements at a given rank. The max this number can be is in my current code (1024^3). and min is 1. blocklength = 1 call MPI_TYPE_INDEXED(ncells_hexa_,blocklength,map,MPI_REAL_SP,fileview_hexa,ierr) MPI_REAL_SP is used for 4-byte scalar data types that are going to be written to the file. (i.e. temperature scalar stored at a given hexa cell) call MPI_TYPE_COMMIT(fileview_hexa,ierr) map = map * 8 here is where problems arise. The map is being multiplied by 8 because the hexa cell node connectivity needs to be written. The node numbering that is being written to the file needs to be 4-bytes, and the max node numbering is able to be held within the 4-byte signed integer. But since I have to map 8*1024^3 displacements to be written map needs to be integer(kind=8). blocklength = 8 call MPI_TYPE_INDEXED(ncells_hexa_,blocklength,map,MPI_INTEGER,fileview_hexa_conn,ierr) MPI_TYPE_INDEXED( 1024^3, blocklength=(/8 8 8 8 8 ..... 8 8 /), map=(/0, 8, 16, 24, ..... , 8589934592/), MPI_INTEGER, file_view_hexa_conn, ierr) Would this be a correct way to declare the newdatatype file_view_hexa_conn. in this call blocklength would be a 4-byte integer array and map would be an 8-byte integer array. To be clear, in the code currently has both map and blocklength as 4-bytes integer arrays. call MPI_TYPE_COMMIT(fileview_hexa_conn,ierr) deallocate(map,blocklength) .... disp = disp+84 call MPI_FILE_SET_VIEW(iunit,disp,MPI_INTEGER,fileview_hexa,"native",MPI_INFO_NULL,ierr) call MPI_FILE_WRITE_ALL(iunit,hexa_,ncells_hexa_,MPI_INTEGER,status,ierr) I believe this could be wrong as well but the fileview_hexa is being used to write the 4-byte integer hexa labeling, but as you said MPI_REAL_SP = MPI_INTEGER = 4-byte may be fine. It has not given any problems thus far. disp = disp+4*ncells_hexa call MPI_FILE_SET_VIEW(iunit,disp,MPI_INTEGER,fileview_hexa_conn,"native",MPI_INFO_NULL,ierr) size = 8*ncells_hexa_ call MPI_FILE_WRITE_ALL(iunit,conn_hexa,size,MPI_INTEGER,status,ierr) Hopefully that is enough information about the issue. Now my questions. 1. Does this implementation look correct. 2. What kind should fileview_hexa and fileview_hexa_conn be? 3. Is it okay that map and blocklength are different integer kinds? 4. What does the blocklength parameter specify exactly. I played with this some and changing the blocklength did not seem to change anything Thanks for the help. -Dominic Kedelty On Wed, Mar 16, 2016 at 12:02 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Dominic, > > at first, you might try to add > call MPI_Barrier(comm,ierr) > between > > if (file_is_there .and. irank.eq.iroot) call > MPI_FILE_DELETE(file,MPI_INFO_NULL,ierr) > > and > > call > MPI_FILE_OPEN(comm,file,IOR(MPI_MODE_WRONLY,MPI_MODE_CREATE),MPI_INFO_NULL,iunit,ierr) > > /* there might me a race condition, i am not sure about that */ > > > fwiw, the > > STOP A configuration file is required > > error message is not coming from OpenMPI. > it might be indirectly triggered by an ompio bug/limitation, or even a bug > in your application. > did you get your application work with an other flavor of OpenMPI ? > e.g. are you reporting an OpenMPI bug ? > or are you asking some help with your application (the bug could either be > in your code or in OpenMPI, and you do not know for sure) > > i am a bit surprised you are using the same fileview_node type with both > MPI_INTEGER and MPI_REAL_SP, but since they should be the same size, that > might not be an issue. > > the subroutine depends on too many external parameters > (nnodes_, fileview_node, ncells_hexa, ncells_hexa_, unstr2str, ...) > so writing a simple reproducer might not be trivial. > > i recommend you first write a self contained program that can be evidenced > to reproduce the issue, > and then i will investigate that. for that, you might want to dump the > array sizes and the description of fileview_node in your application, and > then hard code them into your self contained program. > also how many nodes/tasks are you running and what filesystem are you > running on ? > > Cheers, > > Gilles > > > On 3/16/2016 3:05 PM, Dominic Kedelty wrote: > > Gilles, > > I do not have the latest mpich available. I tested using openmpi version > 1.8.7 as well as mvapich2 version 1.9. both produced similar errors. I > tried the mca flag that you had provided and it is telling me that a > configuration file is needed. > > all processes return: > > STOP A configuration file is required > > I am attaching the subroutine of the code that I believe is where the > problem is occuring. > > > > On Mon, Mar 14, 2016 at 6:25 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Dominic, >> >> this is a ROMIO error message, and ROMIO is from MPICH project. >> at first, I recommend you try the same test with the latest mpich, in >> order to check >> whether the bug is indeed from romio, and has been fixed in the latest >> release. >> (ompi is a few version behind the latest romio) >> >> would you be able to post a trimmed version of your application that >> evidences the test ? >> that will be helpful to understand what is going on. >> >> you might also want to give a try to >> mpirun --mca io ompio ... >> and see whether this helps. >> that being said, I think ompio is not considered as production ready on >> the v1.10 series of ompi >> >> Cheers, >> >> Gilles >> >> >> On Tuesday, March 15, 2016, Dominic Kedelty < <dkede...@asu.edu> >> dkede...@asu.edu> wrote: >> >>> I am getting the following error using openmpi and I am wondering if >>> anyone would have clue as to why it is happening. It is an error coming >>> from openmpi. >>> >>> Error in ADIOI_Calc_aggregator(): rank_index(40) >= fd->hints->cb_nodes >>> (40) fd_size=213909504 off=8617247540 >>> Error in ADIOI_Calc_aggregator(): rank_index(40) >= fd->hints->cb_nodes >>> (40) fd_size=213909504 off=8617247540 >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 157 >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 477 >>> >>> Any help would be appreciated. Thanks. >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/03/18697.php >> > > > > _______________________________________________ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/03/18700.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/03/18701.php >