Hi Julian,

I have a few ideas for you to try to help narrow down these bugs.

I'm not sure how well the small-io stuff will work with non-contig. It was never rigorously tested. Can you recompile with - DPVFS2_SMALL_IO_OFF and run your tests again?

I've attached a patch that fixes the last valgrind error in your list (in PINT_distribute). Can you try it and let me know if that fixes it?

Thanks,

-sam

Attachment: memset-fdata-smallio.patch
Description: Binary data


On Mar 8, 2007, at 10:50 AM, Julian Martin Kunkel wrote:

Hi,

We see a rather strange and wrong behavior with PVFS2 using a file view with
MPI-IO using different levels :)

mpiexec -np 2 ./MPI-IO -i 4 -f pvfs2://pvfs2/test -s 10 level0
0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0101 0101 0101 0101 0101 0101
*
0000030 0101 0000 0000 0000 0000 0000 0000 0000
0000040 0000 0000 0000
0000046

mpiexec -np 2 ./MPI-IO -i 4 -f pvfs2://pvfs2/test -s 10 level2

0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0000 0000 0000 0000 0000 0101
0000020 0101 0101 0101 0101 0000 0000 0000 0000
0000030 0000 0101 0101 0101 0101 0101 0101 0101
0000040 0101 0101 0101
0000046

With this level in addition the number of bytes which are transfered between
client and servers does not match the amount of data it should be...

With a level3(non-contig, coll) and level1 (coll, contig) it looks correct
like:
0000000 0000 0000 0000 0000 0000 0101 0101 0101
0000010 0101 0101 0000 0000 0000 0000 0000 0101
0000020 0101 0101 0101 0101 0000 0000 0000 0000
0000030 0000 0101 0101 0101 0101 0101 0000 0000
0000040 0000 0000 0000 0101 0101 0101 0101 0101
0000050

Minimum setup where this error ocurred was with 3 data servers. However, sometimes for examples with 4 dataservers the bug may disappear. Using 5
dataservers and a bigger file (500K) (mpiexec -np 4 ./MPI-IO -i 10 -f
pvfs2://pvfs2/test -s 50K level2) shows that the content of the file is
different for different runs. The md5sum might be for example:
c809928d82ca72e00469283f2450c5f0
7d215f060b113f81c2210ac6e8e4c6d9
b4ca34c8a8a7b06a9b6d29e4b78964c3

Software: PVFS2 03/08/07 CVS and the new tiled-types-for-mkuhn.diff patch with
the current mpich2-1.0.5-p3...

I did some runs for the levels with valgrind this showed (among other reported
issues) in level0 and level2 the following:
==18294== Invalid read of size 4
==18294== at 0x80EF461: ADIOI_PVFS2_WriteStrided (ad_pvfs2_write.c:392)
==18294==    by 0x80AA299: MPIOI_File_write (write.c:156)
==18294==    by 0x80A9C80: PMPI_File_write (write.c:52)
==18294==    by 0x8056706: ??? (log_mpi_io.c:871)
==18294==    by 0x804ACDA: Test_level0 (MPI-IO.c:75)
==18294==    by 0x804B699: main (MPI-IO.c:309)
==18294== Address 0x4771460 is 0 bytes after a block of size 8 alloc'd
==18294==    at 0x401B867: malloc (vg_replace_malloc.c:149)
==18294==    by 0x80B505C: ADIOI_Malloc_fn (malloc.c:50)
==18294==    by 0x80B4D66: ADIOI_Optimize_flattened (flatten.c:759)
==18294==    by 0x80B3036: ADIOI_Flatten_datatype (flatten.c:79)
==18294==    by 0x80BF8C8: ADIO_Set_view (ad_set_view.c:52)
==18294==    by 0x80AA85A: PMPI_File_set_view (set_view.c:138)
==18294==    by 0x8055CDE: MPI_File_set_view (log_mpi_io.c:611)
==18294==    by 0x804AC80: Test_level0 (MPI-IO.c:70)
==18294==    by 0x804B699: main (MPI-IO.c:309)

Similar for reads in ReadStrided...
These issues are not reported for the other levels and look rather suspicious
for me...

The following issue is common for all levels:
==18315== Conditional jump or move depends on uninitialised value(s)
==18315==    at 0x8121869: PINT_distribute (pint-request.c:740)
==18315==    by 0x811FB0B: PINT_process_request (pint-request.c:322)
==18315== by 0x8139641: small_io_completion_fn (sys-small-io.sm: 257) ==18315== by 0x8180DD9: msgpairarray_completion_fn (msgpairarray.sm:547) ==18315== by 0x812A648: PINT_state_machine_next (state-machine- fns.h:158)
==18315==    by 0x8129D3D: PINT_client_state_machine_test
(client-state-machine.c:559)
==18315==    by 0x812A1C3: PINT_client_wait_internal
(client-state-machine.c:733)
==18315==    by 0x812A3C5: PVFS_sys_wait (client-state-machine.c:861)
==18315==    by 0x813300A: PVFS_sys_io (sys-io.sm:351)
==18315== by 0x80ECCCD: ADIOI_PVFS2_ReadStrided (ad_pvfs2_read.c: 500)
==18315==    by 0x80A9571: MPIOI_File_read (read.c:151)
==18315==    by 0x80A8F58: PMPI_File_read (read.c:52)


Thanks,
Julian
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to