On Mon, Nov 24, 2014 at 1:10 PM, Håkon Strandenes <[email protected]> wrote:
> Hi, > > I have some problems with PETSc and HDF5 VecLoad/VecView. The VecLoad > problems can rest for now, but the VecView are more serious. > > In short: I have a 3D DMDA with and some vectors that I want to save to a > HDF5 file. This works perfectly on my workstation, but not on the compute > cluster I have access to. I have attached a typical error message. > > I have also attached an piece of code that can trigger the error. The code > is merely a 2D->3D rewrite of DMDA ex 10 (http://www.mcs.anl.gov/petsc/ > petsc-current/src/dm/examples/tutorials/ex10.c.html), nothing else is > done. > > The program typically works on small number of processes. I have > successfully executed the attached program on up to 32 processes. That > works. Always. I have never had a single success when trying to run on 64 > processes. Always same error. > > The computer I am struggling with is an SGI machine with SLES 11sp1 and > Intel CPUs, hence I have used Intels compilers. I have tried both 2013, > 2014 and 2015 versions of the compilers, so that's probably not the cause. > I have also tried GCC 4.9.1, just to be safe, same error there. The same > compiler is used for both HDF5 and PETSc. The same error message occurs for > both debug and release builds. I have tried HDF5 versions 1.8.11 and > 1.8.13. I have tried PETSc version 3.4.1 and the latest from Git. The MPI > implementation on the machine is SGI's MPT, and i have tried both 2.06 and > 2.10. Always same error. Other MPI implementations is unfortunately not > available. > > What really drives me mad is that this works like a charm on my > workstation with Linux Mint... I have successfully executed the attached > example on 254 processes (my machine breaks down if I try anything more > than that). > > Does any of you have any tips on how to attack this problem and find out > what's wrong? > This does sound like a pain to track down. It seems to be complaining about an MPI datatype: #005: H5Dmpio.c line 998 in H5D__link_chunk_collective_io(): MPI_Type_struct failed major: Internal error (too specific to document in detail) minor: Some MPI function failed #006: H5Dmpio.c line 998 in H5D__link_chunk_collective_io(): Invalid datatype argument major: Internal error (too specific to document in detail) minor: MPI Error String In this call, we pass in 'scalartype', which is H5T_NATIVE_DOUBLE (unless you configured for single precision). This was used successfully to create the dataspace, so it is unlikely to be the problem. I am guessing that HDF5 creates internal MPI datatypes to use in the MPI/IO routines (maybe using MPI_Type_struct). I believe we have seen type creation routines fail in some MPI implementations if you try to create too many of them. Right now, this looks a lot like a bug in MPT, although it might be an HDF5 bug with forgetting to release MPI types that they do not need. Thanks, Matt > Regards, > Håkon Strandenes > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
