Il 15/07/21 17:44, Matteo Semplice ha scritto:
Hi.

When I write (HDF5 viewer) a vector associated to a DMDA with 1 dof, the output is independent of the number of cpus used.

However, for a DMDA with dof=2, the output seems to be correct when I run on 1 or 2 cpus, but is scrambled when I run with 4 cpus. Judging from the ranges of the data, each field gets written to the correct part, and its the data witin the field that is scrambled. Here's my MWE:

#include <petscversion.h>
#include <petscdmda.h>
#include <petscviewer.h>
#include <petscsys.h>
#include <petscviewerhdf5.h>

int main(int argc, char **argv) {

  PetscErrorCode ierr;
  ierr = PetscInitialize(&argc,&argv,(char*)0,help); CHKERRQ(ierr);
  PetscInt Nx=11;
  PetscInt Ny=11;
  PetscScalar dx = 1.0 / (Nx-1);
  PetscScalar dy = 1.0 / (Ny-1);
  DM dmda;
  ierr = DMDACreate2d(PETSC_COMM_WORLD,
                      DM_BOUNDARY_NONE,DM_BOUNDARY_NONE,
                      DMDA_STENCIL_STAR,
                      Nx,Ny, //global dim
                      PETSC_DECIDE,PETSC_DECIDE, //n proc on each dim
                      2,1, //dof, stencil width
                      NULL, NULL, //n nodes per direction on each cpu
                      &dmda);      CHKERRQ(ierr);
  ierr = DMSetFromOptions(dmda); CHKERRQ(ierr);
  ierr = DMSetUp(dmda); CHKERRQ(ierr); CHKERRQ(ierr);
  ierr = DMDASetUniformCoordinates(dmda, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0); CHKERRQ(ierr);
  ierr = DMDASetFieldName(dmda,0,"s"); CHKERRQ(ierr);
  ierr = DMDASetFieldName(dmda,1,"c"); CHKERRQ(ierr);
  DMDALocalInfo daInfo;
  ierr = DMDAGetLocalInfo(dmda,&daInfo); CHKERRQ(ierr);
  IS *is;
  DM *daField;
  ierr = DMCreateFieldDecomposition(dmda,NULL, NULL, &is, &daField); CHKERRQ(ierr);
  Vec U0;
  ierr = DMCreateGlobalVector(dmda,&U0); CHKERRQ(ierr);

  //Initial data
  typedef struct{ PetscScalar s,c;} data_type;
  data_type **u;
  ierr = DMDAVecGetArray(dmda,U0,&u); CHKERRQ(ierr);
  for (PetscInt j=daInfo.ys; j<daInfo.ys+daInfo.ym; j++){
    PetscScalar y = j*dy;
    for (PetscInt i=daInfo.xs; i<daInfo.xs+daInfo.xm; i++){
      PetscScalar x = i*dx;
      u[j][i].s = x+2.*y;
      u[j][i].c = 10. + 2.*x*x+y*y;
    }
  }
  ierr = DMDAVecRestoreArray(dmda,U0,&u); CHKERRQ(ierr);

  PetscViewer viewer;
  ierr = PetscViewerHDF5Open(PETSC_COMM_WORLD,"solutionSC.hdf5",FILE_MODE_WRITE,&viewer); CHKERRQ(ierr);
  Vec uField;
  ierr = VecGetSubVector(U0,is[0],&uField); CHKERRQ(ierr);
  PetscObjectSetName((PetscObject) uField, "S");
  ierr = VecView(uField,viewer); CHKERRQ(ierr);
  ierr = VecRestoreSubVector(U0,is[0],&uField); CHKERRQ(ierr);
  ierr = VecGetSubVector(U0,is[1],&uField); CHKERRQ(ierr);
  PetscObjectSetName((PetscObject) uField, "C");
  ierr = VecView(uField,viewer); CHKERRQ(ierr);
  ierr = VecRestoreSubVector(U0,is[1],&uField); CHKERRQ(ierr);
  ierr = PetscViewerDestroy(&viewer); CHKERRQ(ierr);

  ierr = PetscFinalize();
  return ierr;
}

and my xdmf file

<?xml version="1.0" ?>
<Xdmf xmlns:xi="https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude&amp;data=04%7C01%7Cmatteo.semplice%40uninsubria.it%7C5207ada4b58b4312880108d947a769e8%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637619607056182657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ksp%2B0oGApgERb%2FIokBrF4q0HzxDoqstxUxl%2FB%2B4Fn3U%3D&amp;reserved=0"; Version="2.0">
  <Domain>
    <Grid GridType="Collection" CollectionType="Temporal">
      <Time TimeType="List">
        <DataItem Dimensions="1">1.0</DataItem>
      </Time>
      <Grid GridType="Uniform" Name="domain">
        <Topology TopologyType="2DCoRectMesh" Dimensions="11 11"/>
        <Geometry GeometryType="ORIGIN_DXDY">
          <DataItem Format="XML" NumberType="Float" Dimensions="2">0.0 0.0</DataItem>           <DataItem Format="XML" NumberType="Float" Dimensions="2">0.1 0.1</DataItem>
        </Geometry>
        <Attribute Name="S" Center="Node" AttributeType="Scalar">
          <DataItem Format="HDF" Precision="8" Dimensions="11 11">solutionSC.hdf5:/S</DataItem>
        </Attribute>
        <Attribute Name="C" Center="Node" AttributeType="Scalar">
          <DataItem Format="HDF" Precision="8" Dimensions="11 11">solutionSC.hdf5:/C</DataItem>
        </Attribute>
      </Grid>
    </Grid>
  </Domain>
</Xdmf>

Steps to reprduce: run code and open the xdmf with paraview. If the code was run with 1,2 or 3 cpus, the data are correct (except the plane xy has become the plane yz), but with 4 cpus the data are scrambled.

Does anyone have any insight?

(I am using Petsc Release Version 3.14.2, but I can compile a newer one if you think it's important.)

Hi,

    I have a small update on this issue.

First, it is still here with version 3.15.2.

Secondly, I have run the code under valgrind and

- for 1 or 2 processes, I get no errors

- for 4 processes, 3 out of 4, trigger the following

==25921== Conditional jump or move depends on uninitialised value(s)
==25921==    at 0xB3D6259: ??? (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so) ==25921==    by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so) ==25921==    by 0xAAEB29B: mca_common_ompio_file_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0) ==25921==    by 0xB316605: mca_io_ompio_file_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so) ==25921==    by 0x73C7FE7: PMPI_File_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921==    by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921==    by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921==    by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921==    by 0x684F091: H5PB_write (H5PB.c:1031)
==25921==    by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921==    by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921==    by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
==25921==    by 0x668C839: H5D__init_storage (H5Dint.c:2473)
==25921==  Uninitialised value was created by a heap allocation
==25921==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==25921==    by 0xB3D6155: ??? (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so) ==25921==    by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so) ==25921==    by 0xAAEB29B: mca_common_ompio_file_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0) ==25921==    by 0xB316605: mca_io_ompio_file_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so) ==25921==    by 0x73C7FE7: PMPI_File_write_at_all (in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921==    by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921==    by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921==    by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921==    by 0x684F091: H5PB_write (H5PB.c:1031)
==25921==    by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921==    by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921==    by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)

Does anyone have any hint on what might be causing this?

Is this the "buggy MPI-IO" that Matt was mentioning in https://lists.mcs.anl.gov/pipermail/petsc-users/2021-July/044138.html?

I am using the release branch (commit c548142fde) and I have configured with --download-hdf5; configure finds the installed openmpi 3.1.3 from Debian buster. The relevant lines from configure.log are

MPI:
  Version:  3
  Mpiexec: mpiexec --oversubscribe
  OMPI_VERSION: 3.1.3
hdf5:
  Version:  1.12.0
  Includes: -I/home/matteo/software/petsc/opt/include
  Library:  -Wl,-rpath,/home/matteo/software/petsc/opt/lib -L/home/matteo/software/petsc/opt/lib -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5

Best

    Matteo

Reply via email to