Il 15/07/21 17:44, Matteo Semplice ha scritto:
Hi.
When I write (HDF5 viewer) a vector associated to a DMDA with 1 dof,
the output is independent of the number of cpus used.
However, for a DMDA with dof=2, the output seems to be correct when I
run on 1 or 2 cpus, but is scrambled when I run with 4 cpus. Judging
from the ranges of the data, each field gets written to the correct
part, and its the data witin the field that is scrambled. Here's my MWE:
#include <petscversion.h>
#include <petscdmda.h>
#include <petscviewer.h>
#include <petscsys.h>
#include <petscviewerhdf5.h>
int main(int argc, char **argv) {
PetscErrorCode ierr;
ierr = PetscInitialize(&argc,&argv,(char*)0,help); CHKERRQ(ierr);
PetscInt Nx=11;
PetscInt Ny=11;
PetscScalar dx = 1.0 / (Nx-1);
PetscScalar dy = 1.0 / (Ny-1);
DM dmda;
ierr = DMDACreate2d(PETSC_COMM_WORLD,
DM_BOUNDARY_NONE,DM_BOUNDARY_NONE,
DMDA_STENCIL_STAR,
Nx,Ny, //global dim
PETSC_DECIDE,PETSC_DECIDE, //n proc on each dim
2,1, //dof, stencil width
NULL, NULL, //n nodes per direction on each cpu
&dmda); CHKERRQ(ierr);
ierr = DMSetFromOptions(dmda); CHKERRQ(ierr);
ierr = DMSetUp(dmda); CHKERRQ(ierr); CHKERRQ(ierr);
ierr = DMDASetUniformCoordinates(dmda, 0.0, 1.0, 0.0, 1.0, 0.0,
1.0); CHKERRQ(ierr);
ierr = DMDASetFieldName(dmda,0,"s"); CHKERRQ(ierr);
ierr = DMDASetFieldName(dmda,1,"c"); CHKERRQ(ierr);
DMDALocalInfo daInfo;
ierr = DMDAGetLocalInfo(dmda,&daInfo); CHKERRQ(ierr);
IS *is;
DM *daField;
ierr = DMCreateFieldDecomposition(dmda,NULL, NULL, &is, &daField);
CHKERRQ(ierr);
Vec U0;
ierr = DMCreateGlobalVector(dmda,&U0); CHKERRQ(ierr);
//Initial data
typedef struct{ PetscScalar s,c;} data_type;
data_type **u;
ierr = DMDAVecGetArray(dmda,U0,&u); CHKERRQ(ierr);
for (PetscInt j=daInfo.ys; j<daInfo.ys+daInfo.ym; j++){
PetscScalar y = j*dy;
for (PetscInt i=daInfo.xs; i<daInfo.xs+daInfo.xm; i++){
PetscScalar x = i*dx;
u[j][i].s = x+2.*y;
u[j][i].c = 10. + 2.*x*x+y*y;
}
}
ierr = DMDAVecRestoreArray(dmda,U0,&u); CHKERRQ(ierr);
PetscViewer viewer;
ierr =
PetscViewerHDF5Open(PETSC_COMM_WORLD,"solutionSC.hdf5",FILE_MODE_WRITE,&viewer);
CHKERRQ(ierr);
Vec uField;
ierr = VecGetSubVector(U0,is[0],&uField); CHKERRQ(ierr);
PetscObjectSetName((PetscObject) uField, "S");
ierr = VecView(uField,viewer); CHKERRQ(ierr);
ierr = VecRestoreSubVector(U0,is[0],&uField); CHKERRQ(ierr);
ierr = VecGetSubVector(U0,is[1],&uField); CHKERRQ(ierr);
PetscObjectSetName((PetscObject) uField, "C");
ierr = VecView(uField,viewer); CHKERRQ(ierr);
ierr = VecRestoreSubVector(U0,is[1],&uField); CHKERRQ(ierr);
ierr = PetscViewerDestroy(&viewer); CHKERRQ(ierr);
ierr = PetscFinalize();
return ierr;
}
and my xdmf file
<?xml version="1.0" ?>
<Xdmf
xmlns:xi="https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude&data=04%7C01%7Cmatteo.semplice%40uninsubria.it%7C5207ada4b58b4312880108d947a769e8%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637619607056182657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ksp%2B0oGApgERb%2FIokBrF4q0HzxDoqstxUxl%2FB%2B4Fn3U%3D&reserved=0"
Version="2.0">
<Domain>
<Grid GridType="Collection" CollectionType="Temporal">
<Time TimeType="List">
<DataItem Dimensions="1">1.0</DataItem>
</Time>
<Grid GridType="Uniform" Name="domain">
<Topology TopologyType="2DCoRectMesh" Dimensions="11 11"/>
<Geometry GeometryType="ORIGIN_DXDY">
<DataItem Format="XML" NumberType="Float" Dimensions="2">0.0
0.0</DataItem>
<DataItem Format="XML" NumberType="Float" Dimensions="2">0.1
0.1</DataItem>
</Geometry>
<Attribute Name="S" Center="Node" AttributeType="Scalar">
<DataItem Format="HDF" Precision="8" Dimensions="11
11">solutionSC.hdf5:/S</DataItem>
</Attribute>
<Attribute Name="C" Center="Node" AttributeType="Scalar">
<DataItem Format="HDF" Precision="8" Dimensions="11
11">solutionSC.hdf5:/C</DataItem>
</Attribute>
</Grid>
</Grid>
</Domain>
</Xdmf>
Steps to reprduce: run code and open the xdmf with paraview. If the
code was run with 1,2 or 3 cpus, the data are correct (except the
plane xy has become the plane yz), but with 4 cpus the data are
scrambled.
Does anyone have any insight?
(I am using Petsc Release Version 3.14.2, but I can compile a newer
one if you think it's important.)
Hi,
I have a small update on this issue.
First, it is still here with version 3.15.2.
Secondly, I have run the code under valgrind and
- for 1 or 2 processes, I get no errors
- for 4 processes, 3 out of 4, trigger the following
==25921== Conditional jump or move depends on uninitialised value(s)
==25921== at 0xB3D6259: ??? (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xAAEB29B: mca_common_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921== by 0xB316605: mca_io_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921== by 0x73C7FE7: PMPI_File_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921== by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921== by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921== by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921== by 0x684F091: H5PB_write (H5PB.c:1031)
==25921== by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921== by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921== by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
==25921== by 0x668C839: H5D__init_storage (H5Dint.c:2473)
==25921== Uninitialised value was created by a heap allocation
==25921== at 0x483577F: malloc (vg_replace_malloc.c:299)
==25921== by 0xB3D6155: ??? (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xAAEB29B: mca_common_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921== by 0xB316605: mca_io_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921== by 0x73C7FE7: PMPI_File_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921== by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921== by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921== by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921== by 0x684F091: H5PB_write (H5PB.c:1031)
==25921== by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921== by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921== by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
Does anyone have any hint on what might be causing this?
Is this the "buggy MPI-IO" that Matt was mentioning in
https://lists.mcs.anl.gov/pipermail/petsc-users/2021-July/044138.html?
I am using the release branch (commit c548142fde) and I have configured
with --download-hdf5; configure finds the installed openmpi 3.1.3 from
Debian buster. The relevant lines from configure.log are
MPI:
Version: 3
Mpiexec: mpiexec --oversubscribe
OMPI_VERSION: 3.1.3
hdf5:
Version: 1.12.0
Includes: -I/home/matteo/software/petsc/opt/include
Library: -Wl,-rpath,/home/matteo/software/petsc/opt/lib
-L/home/matteo/software/petsc/opt/lib -lhdf5hl_fortran -lhdf5_fortran
-lhdf5_hl -lhdf5
Best
Matteo