https://bitbucket.org/petsc/petsc/pull-requests/1551/chunksize-could-overflow-and-become/diff

With this fix I can run with your vector size on 1 process. With 2 processes I 
get

$ petscmpiexec -n 2 ./ex1 
Assertion failed in file adio/common/ad_write_coll.c at line 904: 
(curr_to_proc[p] + len - done_to_proc[p]) == (unsigned) (curr_to_proc[p] + len 
- done_to_proc[p])
0   libpmpi.0.dylib                     0x0000000111241f3e backtrace_libc + 62
1   libpmpi.0.dylib                     0x0000000111241ef5 MPL_backtrace_show + 
21
2   libpmpi.0.dylib                     0x000000011119f85a MPIR_Assert_fail + 90
3   libpmpi.0.dylib                     0x00000001111a15f3 MPIR_Ext_assert_fail 
+ 35
4   libmpi.0.dylib                      0x0000000110eee16e 
ADIOI_Fill_send_buffer + 1134
5   libmpi.0.dylib                      0x0000000110eefe74 
ADIOI_W_Exchange_data + 2980
6   libmpi.0.dylib                      0x0000000110eed7ad ADIOI_Exch_and_write 
+ 3197
7   libmpi.0.dylib                      0x0000000110eec854 
ADIOI_GEN_WriteStridedColl + 2004
8   libpmpi.0.dylib                     0x000000011128ad4b MPIOI_File_write_all 
+ 1179
9   libmpi.0.dylib                      0x0000000110ec382b 
MPI_File_write_at_all + 91
10  libhdf5.10.dylib                    0x00000001108b982a H5FD_mpio_write + 
1466
11  libhdf5.10.dylib                    0x00000001108b127a H5FD_write + 634
12  li

Looks like an int overflow in the MPIIO. (It is scary to see the ints in the 
ADIO code as opposed to 64 bit integers but I guess somehow it works, maybe 
this is a strange corner case and I don't know if the problem is with HDF5 or 
MPIIO) 

 on 4 and 8 processes it runs. 

Note that you are playing with a very dangerous size. 32768 * 32768 * 2 is a 
negative number in int. So this is essentially the largest problem you can run 
before switching to 64 bit indices for PETSc. 

  Barry



> On Apr 16, 2019, at 9:32 AM, Sajid Ali via petsc-users 
> <petsc-users@mcs.anl.gov> wrote:
> 
> Hi PETSc developers,
> 
> I’m trying to write a large vector created with VecCreateMPI (size 
> 32768x32768) concurrently from 4 nodes (+32 tasks per node, total 128 
> mpi-ranks) and I see the following (indicative) error : [Full error log is 
> here : https://file.io/CdjUfe] 
> 
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 52:
>   #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link 
> to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to 
> object
>     major: Links
>     minor: Unable to initialize object
>   #003: H5L.c line 1798 in H5L__create_real(): can't insert link
>     major: Links
>     minor: Unable to insert object
>   #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal 
> failed
>     major: Symbol table
> HDF5-DIAG: Error detected in HDF5 (1.10.5) MPI-process 59:                    
>           
>   #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset              
>           
>     major: Dataset                                                            
>           
>     minor: Unable to initialize object                                        
>           
>   #001: H5Dint.c line 329 in H5D__create_named(): unable to create and link 
> to dataset  
>     major: Dataset                                                            
>           
>     minor: Unable to initialize object                                        
>           
>   #002: H5L.c line 1557 in H5L_link_object(): unable to create new link to 
> object       
>     major: Links                                                              
>           
>     minor: Unable to initialize object                                        
>           
>   #003: H5L.c line 1798 in H5L__create_real(): can't insert link              
>           
>     major: Links                                                              
>           
>     minor: Unable to insert object                                            
>           
>   #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal 
> failed        
>     major: Symbol table                                                       
>           
>     minor: Object not found                                                   
>           
>   #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator 
> failed       
>     major: Symbol table                                                       
>           
>     minor: Callback failed                                                    
>           
>   #006: H5L.c line 1604 in H5L__link_cb(): unable to create object            
>           
>     major: Links                                                              
>           
>     minor: Unable to initialize object                                        
>           
>   #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object         
>           
>     major: Object header                                                      
>           
>     minor: Can't open object                                                  
>           
>   #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset      
>           
>     minor: Object not found                                                   
>           
>   #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator 
> failed       
>     major: Symbol table                                                       
>           
>     minor: Callback failed                                                    
>           
>   #006: H5L.c line 1604 in H5L__link_cb(): unable to create object            
>           
>     major: Links                                                              
>           
>     minor: Unable to initialize object                                        
>           
>   #007: H5Oint.c line 2453 in H5O_obj_create(): unable to open object         
>           
>     major: Object header                                                      
>           
>     minor: Can't open object                                                  
>           
>   #008: H5Doh.c line 300 in H5O__dset_create(): unable to create dataset      
>           
>     major: Dataset                                                            
>           
>     minor: Unable to initialize object                                        
>           
>   #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout 
> information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk 
> sizes
>     major: Dataset
>     minor: Bad value
>   #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 
> 4GB
>     major: Dataset
>     minor: Unable to initialize object
>     major: Dataset
>     minor: Unable to initialize object
>   #009: H5Dint.c line 1274 in H5D__create(): unable to construct layout 
> information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: H5Dchunk.c line 872 in H5D__chunk_construct(): unable to set chunk 
> sizes
>     major: Dataset
>     minor: Bad value
>   #011: H5Dchunk.c line 831 in H5D__chunk_set_sizes(): chunk size must be < 
> 4GB
>     major: Dataset
>     minor: Unable to initialize object
> .......
> 
> I spoke to Barry last evening who said that this is a known error that was 
> fixed for DMDA vecs but is broken for non-dmda vecs.
> 
> Could this be fixed ?
> 
> 
> Thank You, 
> Sajid Ali
> Applied Physics
> Northwestern University

Reply via email to