Because you couldn't reproduce the same errors,
I've done some major testing... (Doubt sneaked in, and I was in need of
some reassuring ;-)

Test systems:

1) my laptop: Athlon XP 2200+ (32bit) gentoo, Linux version 2.6.16
([EMAIL PROTECTED]) (gcc versie 3.4.5 (Gentoo 3.4.5-r1, ssp-3.4.5-1.0,
pie-8.7.9)) #1 PREEMPT Sat Apr 1 14:33:56 CEST 2006,
gcc:
Reading specs from /usr/lib/gcc/i686-pc-linux-gnu/3.4.6/specs
Configured with: /usr/tmp/portage/gcc-3.4.6-r1/work/gcc-3.4.6/configure
--prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/3.4.6
--includedir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include
--datadir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6
--mandir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/man
--infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.4.6/info
--with-gxx-include-dir=/usr/lib/gcc/i686-pc-linux-gnu/3.4.6/include/g++-v3
--host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec
--enable-nls --without-included-gettext --with-system-zlib
--disable-checking --disable-werror --disable-libunwind-exceptions
--disable-multilib --disable-libgcj --enable-languages=c,c++,f77
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu
Thread model: posix
gcc version 3.4.6 (Gentoo 3.4.6-r1, ssp-3.4.5-1.0, pie-8.7.9)

2) desktop system: P4 3Ghz, 2GB memory (highmem on), Linux version
2.6.16.1 ([EMAIL PROTECTED]) (gcc versie 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0,
pie-8.7.8)) #1 SMP PREEMPT Mon Apr 3 17:35:58 CEST 2006,
Reading specs from /usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.6/specs
Reading specs from
/usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.6/hardenednossp.specs
Configured with: /home/tmp/portage/gcc-3.3.6/work/gcc-3.3.6/configure
--prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/3.3.6
--includedir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.6/include
--datadir=/usr/share/gcc-data/i686-pc-linux-gnu/3.3.6
--mandir=/usr/share/gcc-data/i686-pc-linux-gnu/3.3.6/man
--infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.3.6/info
--with-gxx-include-dir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.6/include/g++-v3
--host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec
--enable-nls --without-included-gettext --with-system-zlib
--disable-checking --disable-werror --disable-libunwind-exceptions
--disable-multilib --disable-libgcj --enable-languages=c,c++,f77
--enable-shared --enable-threads=posix --enable-__cxa_atexit
--enable-clocale=gnu
Thread model: posix
gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)

3) cluster: dual opteron (64bit mode), 4Gb/mem, Linux version 2.6.14.4
([EMAIL PROTECTED]) (gcc version 3.3.5 (Debian 1:3.3.5-13)) #1 SMP Tue Dec 20
16:58:56 CET 2005,
Reading specs from
/apps/prod/gcc/3.4.4/bin/../lib/gcc/x86_64-unknown-linux-gnu/3.4.4/specs
Configured with: ../../gcc-3.4.4/configure --prefix=/apps/prod/gcc/3.4.4/
Thread model: posix
gcc version 3.4.4

This is a modified debian system, kernel has lustre patches.


To exclude any coincidences, I've made an automatic install & test
script (in attachment). The same script was run on all systems (with
some very minor modifications such as additional paths on systems)

On all systems, the PVFS2 server was running locally
(1 metadata, 1 data); The database was erased and recreated (-f)
for every test. I've also verified that I had no old libraries/other
installs of hdf5/mpich/... .

The bad news: I've been able to reproduce the errors I got.
The slightly less bad news: I've been able to reproduce the error you
got. I'm not really sure if both are related.

All tests with mpich-1.0.3, pvfs2 1.4.0, hdf5 1.6.5

================= SYSTEM 1 ==================

* Unable to get the corruption errors, but I did find another problem:

Testing  -- collective irregular contiguous read (ccontr)
Testing  -- collective irregular contiguous read (ccontr)
rank 2 in job 24  ltslaptop_38255   caused collective abort of all ranks
  exit status of rank 2: killed by signal 8

(Just for the record: the following error can occur of you lack enough
free disk space)
Testing  -- extendible dataset collective write (ecdsetw)
Testing  -- extendible dataset collective write (ecdsetw)
Proc 2: *** PHDF5 ERROR ***
        Assertion (H5Dwrite succeeded) failed at line 1876 in
../../testpar/t_dset.c
aborting MPI process
rank 0 in job 9  ltslaptop_38255   caused collective abort of all ranks
  exit status of rank 0: killed by signal 14

The interesting thing is that this is not related to pvfs2. The same
thing happens for the ROMIO AD ufs. This looks like a ROMIO / MPICH2.
I doubt it is an error in HDF5, because when testing with Open MPI on
non-pvfs2 filesystems, all tests work without problems.

So, on my laptop:
pvfs2: fail; crash
ufs: fail; crash

===================== SYSTEM 2 ====================

========> Test on PVFS2:
Testing  -- extendible dataset collective read (ecdsetr)
Testing  -- extendible dataset collective read (ecdsetr)
Dataset Verify failed at [0][0](row 0, col 800): expect 801, got 0
Dataset Verify failed at [0][1](row 0, col 801): expect 802, got 0
Dataset Verify failed at [0][2](row 0, col 802): expect 803, got 0
Dataset Verify failed at [0][3](row 0, col 803): expect 804, got 0
Dataset Verify failed at [0][4](row 0, col 804): expect 805, got 0
Dataset Verify failed at [0][5](row 0, col 805): expect 806, got 0
Dataset Verify failed at [0][6](row 0, col 806): expect 807, got 0
Dataset Verify failed at [0][7](row 0, col 807): expect 808, got 0
Dataset Verify failed at [0][8](row 0, col 808): expect 809, got 0
Dataset Verify failed at [0][9](row 0, col 809): expect 810, got 0
[more errors ...]
40 errors found in dataset_vrfy
Proc 2: *** PHDF5 ERROR ***
        Assertion (dataset2 read verified correct) failed at line 2060
in ../../testpar/t_dset.c
aborting MPI process
Dataset Verify failed at [0][0](row 0, col 400): expect 401, got 0
Dataset Verify failed at [0][1](row 0, col 401): expect 402, got 0
Dataset Verify failed at [0][2](row 0, col 402): expect 403, got 0
Dataset Verify failed at [0][3](row 0, col 403): expect 404, got 0
Dataset Verify failed at [0][4](row 0, col 404): expect 405, got 0
Dataset Verify failed at [0][5](row 0, col 405): expect 406, got 0
Dataset Verify failed at [0][6](row 0, col 406): expect 407, got 0
Dataset Verify failed at [0][7](row 0, col 407): expect 408, got 0
Dataset Verify failed at [0][8](row 0, col 408): expect 409, got 0
Dataset Verify failed at [0][9](row 0, col 409): expect 410, got 0
[more errors ...]
80 errors found in dataset_vrfy
Proc 1: *** PHDF5 ERROR ***
        Assertion (dataset2 read verified correct) failed at line 2060
in ../../testpar/t_dset.c
aborting MPI process

========> Test on UFS:
Testing  -- collective irregular contiguous read (ccontr)
Testing  -- collective irregular contiguous read (ccontr)
rank 1 in job 10  mhd3_57187   caused collective abort of all ranks
  exit status of rank 1: killed by signal 8
Command exited with non-zero status 136
0.10user 0.02system 0:06.47elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1771minor)pagefaults 0swaps
make[1]: *** [_check-p] Error 1
make[1]: Leaving directory
`/home/lts/iotest/install/hdf5-1.6.5/build/testpar'
make: *** [check] Error 2

=> Crash (NOT assertion failure!)
Again, this doesn't happen with Open MPI

Conclusion:
PVFS2: corruption
UFS: same crash

============================= SYSTEM 3 ================================

===> Test on PVFS2
Testing  -- extendible dataset collective read (ecdsetr)
Testing  -- extendible dataset collective read (ecdsetr)
Testing  -- extendible dataset collective read (ecdsetr)
Dataset Verify failed at [0][0](row 0, col 800): expect 801, got 0
Dataset Verify failed at [0][1](row 0, col 801): expect 802, got 0
Dataset Verify failed at [0][2](row 0, col 802): expect 803, got 0
Dataset Verify failed at [0][3](row 0, col 803): expect 804, got 0
Dataset Verify failed at [0][4](row 0, col 804): expect 805, got 0
Dataset Verify failed at [0][5](row 0, col 805): expect 806, got 0
Dataset Verify failed at [0][6](row 0, col 806): expect 807, got 0
Dataset Verify failed at [0][7](row 0, col 807): expect 808, got 0
Dataset Verify failed at [0][8](row 0, col 808): expect 809, got 0
Dataset Verify failed at [0][9](row 0, col 809): expect 810, got 0
[more errors ...]
40 errors found in dataset_vrfy
Proc 2: *** PHDF5 ERROR ***
        Assertion (dataset2 read verified correct) failed at line 2060
in ../../testpar/t_dset.c
aborting MPI process
Testing  -- extendible dataset independent write #2 (eidsetw2)
Dataset Verify failed at [0][0](row 0, col 400): expect 401, got 0
Dataset Verify failed at [0][1](row 0, col 401): expect 402, got 0
Dataset Verify failed at [0][2](row 0, col 402): expect 403, got 0
Dataset Verify failed at [0][3](row 0, col 403): expect 404, got 0
Dataset Verify failed at [0][4](row 0, col 404): expect 405, got 0
Dataset Verify failed at [0][5](row 0, col 405): expect 406, got 0
Dataset Verify failed at [0][6](row 0, col 406): expect 407, got 0
Dataset Verify failed at [0][7](row 0, col 407): expect 408, got 0
Dataset Verify failed at [0][8](row 0, col 408): expect 409, got 0
Dataset Verify failed at [0][9](row 0, col 409): expect 410, got 0
[more errors ...]
80 errors found in dataset_vrfy
Proc 1: *** PHDF5 ERROR ***
        Assertion (dataset2 read verified correct) failed at line 2060
in ../../testpar/t_dset.c
aborting MPI process

==> Exactly same problem as on system 2

====> Test on NFS
Testing  -- collective irregular contiguous read (ccontr)
Testing  -- collective irregular contiguous read (ccontr)
Testing  -- collective irregular contiguous read (ccontr)
rank 1 in job 6  lo-03-01_33192   caused collective abort of all ranks
  exit status of rank 1: killed by signal 8
Command exited with non-zero status 136

==> Same error as on system 2

I did find something pointing to the error:
testing on lustre:

when using HDF5_PARAPREFIX=/l/users/dries
Testing  -- collective irregular contiguous read (ccontr)
Testing  -- collective irregular contiguous read (ccontr)
rank 1 in job 16  lo-03-01_33192   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
Command exited with non-zero status 137

when using HDF5_PARAPREFIX=nfs://l/users/dries
Testing  -- extendible dataset independent write #2 (eidsetw2)
Testing  -- compressed dataset collective read (cmpdsetr)
Testing  -- compressed dataset collective read (cmpdsetr)
Proc 0: *** PHDF5 ERROR ***
        Assertion (H5Fcreate succeeded) failed at line 2134 in
../../testpar/t_dset.c
aborting MPI process

when using HDF5_PARAPREFIX=ufs://l/users/dries
Testing  -- compressed dataset collective read (cmpdsetr)
Testing  -- compressed dataset collective read (cmpdsetr)
Proc 0: *** PHDF5 ERROR ***
        Assertion (H5Fcreate succeeded) failed at line 2134 in
../../testpar/t_dset.c
aborting MPI process

I went back and tested the same thing on my laptop:

HDF5_PARAPREFIX=/usr/tmp
Testing  -- collective irregular contiguous read (ccontr)
rank 2 in job 22  ltslaptop_38255   caused collective abort of all ranks
  exit status of rank 2: killed by signal 8
Command exited with non-zero status 9
0.18user 0.02system 0:10.86elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1688minor)pagefaults 0swaps

HDF5_PARAPREFIX=ufs://usr/tmp
Testing  -- extendible dataset independent write #2 (eidsetw2)
Testing  -- extendible dataset independent write #2 (eidsetw2)
Testing  -- compressed dataset collective read (cmpdsetr)
Proc 0: *** PHDF5 ERROR ***
        Assertion (H5Fcreate succeeded) failed at line 2134 in
../../testpar/t_dset.c
aborting MPI process


I hope you can make some more sense of this;
In the mean time, I'll try to test another system tomorrow.

On system 3, I also have done tests with Open MPI.
* Open MPI(svn trunk) + with HDF5_PARAPREFIX=$PWD (or unset) +
NFS/lustre: everything OK :-)
* HDF5_PARAPREFIX=ufs://$PWD : H5Fcreate error.
* HDF5_PARAPREFIX=pvfs2://... : data corruption errors

  Greetings,
  Dries





Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

#!/bin/bash

INSTDIR=$HOME/tmpinst
DBSTORAGE=/tmp/pvfs2-storage
INSTALLTREE=$PWD/install

if [ -d $INSTDIR ] 
then
        echo Already installed. Remove old install "($INSTDIR)" first
        exit 1
fi

if [ -d $INSTALLTREE ]
then
        echo Found old build tree at $INSTALLTREE. Remove first
        exit 1
fi

# make install dir
mkdir $INSTALLTREE 

# pvfs2
tar -C $INSTALLTREE -xvzf pvfs2-1.4.0.tar.gz
pushd $INSTALLTREE/pvfs2-1.4.0
mkdir build
cd build
../configure --prefix=$INSTDIR || exit 1
make || exit 1
make install || exit 1
popd

# set path
export PATH=$INSTDIR/sbin:$INSTDIR/bin:$PATH

# Start pvfs2 server
export PVFS2TAB_FILE=$PWD/pvfs2tab
killall -KILL pvfs2-server
rm -rf /tmp/pvfs2-storage-space
$INSTDIR/sbin/pvfs2-server -f pvfs2-fs.conf pvfs2-server.conf-localhost
$INSTDIR/sbin/pvfs2-server pvfs2-fs.conf pvfs2-server.conf-localhost

# try ping
pvfs2-ping -m /tmp/pvfs2  || exit 1

# mpich install
tar -C $INSTALLTREE -xvzf mpich2-1.0.3.tar.gz
pushd $INSTALLTREE/mpich2-1.0.3
mkdir build
cd build
../configure --prefix=$INSTDIR --disable-f77 --disable-fortran --disable-f90\
        --enable-romio --with-file-system=ufs+nfs+pvfs2 \
        CFLAGS="-I$INSTDIR/include" \
        LDFLAGS="-L$INSTDIR/lib -Wl,-rpath -Wl,$INSTDIR/lib" \
        LIBS="-lpvfs2 -lpthread" || exit 1

make || exit 1
make install || exit 1
popd


# boot mpich2
killall -KILL mpdboot
mpdboot


# hdf5
tar -C $INSTALLTREE -xvzf hdf5-1.6.5.tar.gz
pushd $INSTALLTREE/hdf5-1.6.5
mkdir build
cd build
../configure --prefix=$INSTDIR --enable-parallel CC=mpicc || exit 1
make || exit 1
cd testpar
unset HDF5_PARAPREFIX
make check | tee check-normal.txt
export HDF5_PARAPREFIX=pvfs2://tmp/pvfs2
make check | tee check-pvfs2.txt

popd
<Defaults>
        UnexpectedRequests 50
        LogFile /tmp/pvfs2-server.log
        EventLogging none
        LogStamp usec
        BMIModules bmi_tcp
        FlowModules flowproto_multiqueue
        PerfUpdateInterval 1000
        ServerJobBMITimeoutSecs 30
        ServerJobFlowTimeoutSecs 30
        ClientJobBMITimeoutSecs 300
        ClientJobFlowTimeoutSecs 300
        ClientRetryLimit 5
        ClientRetryDelayMilliSecs 2000
</Defaults>

<Aliases>
        Alias localhost tcp://localhost:3334
</Aliases>

<Filesystem>
        Name pvfs2-fs
        ID 64575174
        RootHandle 1048576
        <MetaHandleRanges>
                Range localhost 4-2147483650
        </MetaHandleRanges>
        <DataHandleRanges>
                Range localhost 2147483651-4294967297
        </DataHandleRanges>
        <StorageHints>
                TroveSyncMeta yes
                TroveSyncData no
                AttrCacheKeywords datafile_handles,metafile_dist
                AttrCacheKeywords dir_ent, symlink_target
                AttrCacheSize 4093
                AttrCacheMaxNumElems 32768
        </StorageHints>
</Filesystem>
tcp://localhost:3334/pvfs2-fs /tmp/pvfs2 pvfs2 defaults
StorageSpace /usr/tmp/pvfs2-storage-space
HostID "tcp://localhost:3334"
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to