Yes, obviously something is broken, and I'm not sure why switching networks would break it. I'm the expert on the distribution code - which is the PINT_process_request thing. I would suggest starting in io_find_target_datafiles where presumably it sets up the result structure and see if you can figure out why it is setting that parameter wrong. My intuition is that shouldn't depend on which network is used, but I'm not familiar with that code.

Walt

Scott Atchley wrote:
Walt,

It does not crash, but it clearly should not have 0 or negative values for bytes. Kyle's original post is that this succeeds when he uses TCP but does not succeed when using MX. He thought (and I agree) that changing the BMI method, in general, should not change the result.

I have seen applications break when moving from TCP to MX when the developer overlooks possible race conditions (e.g. assume that all nodes progress in lock-step when, in fact, some may progress faster than others).

I do not know if the values are set (BMI, PVFS2 or MPICH2). I plan to do a little more debugging to try to narrow it down.

Scott

On Aug 6, 2007, at 1:20 PM, walt wrote:

What do you mean when you say "fails?" What you have shown here SHOULD produce an error - it should not crash. The bytemax should not be less than bytes, and in any case should not be negative. It seems that the caller has for some reason passed an inproperly set up result structure.

I haven't check the bmi code, but this appears to be a module that is trying to decide which servers have part of the data for this request. For this we usually set the bytemax to 1 (which says if there is at least one byte on this server, stop and let us know). Maybe we should add an error check for a negative bytemax, but at least in this case it should have called gossip_error.

Walt

Scott Atchley wrote:
Hi Sam,
Kyle sent me the code and I compiled it this morning.
First, I was using mpich2-mx compiled with PVFS2 support. It failed with the error that MX was already initialized. Both mpich2-mx and bmi_mx are calling mx_init(). I changed bmi_mx to ignore MX_ALREADY_INITIALIZED. Second, I do not see any errors returned in bmi_mx. It fails in PINT_process_request (see call trace below). The request has segs = 0, bytemax = -1291, and bytes = 0. It could well be that these values are incorrect due to a bug in bmi_mx that is not flagging an error, but I have no idea.
Can you take a look at this?
Thanks,
Scott
0:  (gdb) b PINT_process_request
0: Breakpoint 2 at 0x4701c8: file src/io/description/pint-request.c, line 72.
0:  (gdb) run -fname pvfs2://mnt/pvfs2/atchley/blah -fsize 1 -timing
0:  Continuing.
0:  ========= Parameter space dump =========
0:  filename: pvfs2://mnt/pvfs2/atchley/blah  ionodes
0:  file size (MB): 1 buffer size 0
0:  vector length: 10 element count: 1 vector count: 0
0:  striping factor: 0 striping size: -1 collective buffer size: 0
0:  loops: 1 displacement 0
0:  ========= Dump done            =========
0:  #* no verification possible!
0: calling noncontigmem_noncontigfile(pvfs2://mnt/pvfs2/atchley/blah, 0x0x2aaaaaaab010, 1048560)
0:
0: # testing noncontiguous in memory, noncontiguous in file using independent I/O
0:  # vector count = 26214 - access count = 26214
0:  calling MPI_File_open(pvfs2://mnt/pvfs2/atchley/blah)
0:  calling MPI_File_set_view()
0:  calling MPI_File_seek()
0:  calling MPI_File_write()
0:  [New Thread 1082132816 (LWP 29290)]
0:  [New Thread 1090525520 (LWP 29291)]
0:
0:  Breakpoint 2, PINT_process_request (req=0x6aea50, mem=0x6aeb00,
0:      rfdata=0x7fffd112b880, result=0x7fffd112b850, mode=2)
0:      at src/io/description/pint-request.c:72
0: 72 void *temp_space = NULL; /* temp copy of req state for size call */
0:  (gdb) 0:  (gdb) bt
0: #0 PINT_process_request (req=0x6aea50, mem=0x6aeb00, rfdata=0x7fffd112b880, 0: result=0x7fffd112b850, mode=2) at src/io/description/pint-request.c:72 0: #1 0x00000000004844e0 in io_find_target_datafiles (mem_req=0x6ad160, 0: file_req=0x6ae960, file_req_offset=0, dist_p=0x6ae9c0, fs_id=1825963815, 0: io_type=PVFS_IO_WRITE, input_handle_array=0x6b9510, input_handle_count=4, 0: handle_index_array=0x6b9240, handle_index_out_count=0x7fffd112b944, 0: sio_handle_index_array=0x6aea30, sio_handle_index_count=0x7fffd112b940)
0:      at src/client/sysint/sys-io.sm:2320
0:  #2  0x0000000000480010 in io_datafile_setup_msgpairs (sm_p=0x6ba4a0,
0:      js_p=0x7fffd112b9f0) at src/client/sysint/sys-io.sm:489
0:  #3  0x0000000000476a66 in PINT_state_machine_next (s=0x6ba4a0,
0:      r=0x7fffd112b9f0) at ./src/common/misc/state-machine-fns.h:158
0: #4 0x0000000000476645 in PINT_client_state_machine_post (sm_p=0x6ba4a0,
0:      pvfs_sys_op=6, op_id=0x7fffd112bb30, user_ptr=0x0)
0:      at src/client/sysint/client-state-machine.c:312
0:  #5  0x000000000047f9fc in PVFS_isys_io (ref=
0: {handle = 1048563, fs_id = 1825963815, __pad1 = 0}, file_req=0x6ae960, 0: file_req_offset=0, buffer=0x0, mem_req=0x6ad160, credentials=0x6b8ea0, 0: resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE, op_id=0x7fffd112bb30,
0:      user_ptr=0x0) at src/client/sysint/sys-io.sm:328
0:  #6  0x000000000047facf in PVFS_sys_io (ref=
0: {handle = 1048563, fs_id = 1825963815, __pad1 = 0}, file_req=0x6ae960, 0: file_req_offset=0, buffer=0x0, mem_req=0x6ad160, credentials=0x6b8ea0,
0:      resp_p=0x7fffd112bba0, io_type=PVFS_IO_WRITE)
0:      at src/client/sysint/sys-io.sm:351
0:  #7  0x0000000000458cb2 in ADIOI_PVFS2_WriteStrided (fd=0x6b8d00,
0: buf=0x2aaaaaaab010, count=26214, datatype=-1946157050, file_ptr_type=101,
0:      offset=0, status=0x7fffd112be30, error_code=0x7fffd112bd70)
0: at /nfs/home/atchley/projects/mpich2/mpich2-snap-200706132016/src/mpi/romio/adio/ad_pvfs2/ad_pvfs2_write.c:1001 0: #8 0x000000000041afcb in MPIOI_File_write (mpi_fh=0x6b8d00, offset=0, 0: file_ptr_type=101, buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,
0:      myname=0x63ac74 "MPI_FILE_WRITE", status=0x7fffd112be30)
0: at /nfs/home/atchley/projects/mpich2/mpich2-snap-200706132016/src/mpi/romio/mpi-io/write.c:156 0: #9 0x000000000041aafd in PMPI_File_write (mpi_fh=0x6b8d00,
0:      buf=0x2aaaaaaab010, count=26214, datatype=-1946157050,
0:      status=0x7fffd112be30)
0: at /nfs/home/atchley/projects/mpich2/mpich2-snap-200706132016/src/mpi/romio/mpi-io/write.c:52 0: #10 0x000000000040461e in noncontigmem_noncontigfile ( 0: filename=0x668110 "pvfs2://mnt/pvfs2/atchley/blah", buf=0x2aaaaaaab010, 0: bufsize=1048560, dtype=-1946157050, offset=0, displs=0, finfo=-1677721600,
0:      veclen=10, elmtcount=1, veccount=26214) at noncontig.c:185
0:  #11 0x000000000040738d in main (argc=1, argv=0x7fffd112c608)
0:      at noncontig.c:1020
0:  (gdb) s
0: 74 PVFS_offset contig_offset = 0; /* temp for offset of a contig region */
0:  (gdb)
0:  78          if (!PINT_IS_MEMREQ(mode))
0:  (gdb)
0:  79          gossip_debug(GOSSIP_REQUEST_DEBUG,
0:  (gdb)
0: 81 gossip_debug(GOSSIP_REQUEST_DEBUG,"PINT_process_request\n");
0:  (gdb)
0:  83          if (!req)
0:  (gdb)
0:  88          if (!result || !result->segmax || !result->bytemax)
0:  (gdb) p *result
0: $1 = {offset_array = 0x7fffd112b8a8, size_array = 0x7fffd112b8a0, segmax = 1,
0:    segs = 0, bytemax = -1291, bytes = 0}
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
<walt.vcf>
begin:vcard
fn:Walt Ligon
n:Ligon;Walt
org:Clemson University;ECE Department
adr;dom:;;;Clemson;SC;29634
email;internet:[EMAIL PROTECTED]
title:Associate Professor
tel;work:864-656-1224
x-mozilla-html:FALSE
version:2.1
end:vcard

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to