Marty,
If my understand is right, when multiple clients issue non-collective I/O and if their data buffer is a vector of small non-overlapping file regions, instead of performing 'n' seeks + read/write ROMIO uses data sieving algorithm. For data-sieving write, first the extent of request is read into big buffer and respective write vectors memcpy'd into big buffer and then single BIG write is performed. Prior to performing data-sieving write, ROMIO locks the portion of the file pertaining to data-sieving buff-size, does seek + write, and then unlocks the file-range. This ensures the file integrity. ROMIO relies on ADIO-FS specific locking (in this case Lustre). So if the underlying file-system does not support fcntl() lock, then you see errors when the extent of the non-collective writes from multiple clients overlap. The easy solution, would be to replace non-collective MPI-IO calls with collective I/O MPI-IO calls. The two phase collective I/O algorithm should ensure file integrity and does not rely on file-locking since each process writes to a big non-overlapping region during the second phase. Or if you have to use non-collective I/O, may be implement ad_lustre fcntl exclusive lock using i) fcntl(EXCL_LOCK) --> open(lock_file, O_CREATE | O_EXCL) + close fcntl(UNLOCK) --> unlink(lock_file) ii) fcntl(EXCL_LOCK) --> MPI_Win_Lock() fcntl(EXCL_LOCK) --> MPI_Win_Unlock() Ofcourse you need to create a one-sided shared buffer in rank 0 when the file is opened MPI_File_Open + buffer destroyed during MPI_File_close() HTH, -Kums ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marty Barnaby Sent: Thursday, May 08, 2008 12:35 PM Cc: [EMAIL PROTECTED] Subject: Re: [Lustre-discuss] mpi-io support To return to this discussion, in recent testing, I have found that writing to a Lustre FS via a higher level library, like PNetCDF, fails because the default for value for romio_ds_write is not disable. This is set in the mpich code in the file /src/mpi/romio/adio/common/ad_hints.c I believe it has something to do with locking issues. I'm not sure how best to handle this, I'd prefer the data sieving default be disable, though I don't know all the implications there. Maybe an ad_lustre_open should be a place where the _ds_ hints are set to disable. Marty Barnaby Weikuan Yu wrote: Andreas Dilger wrote: On Mar 11, 2008 16:10 -0600, Marty Barnaby wrote: I'm not actually sure what ROMIO abstract device the multiple CFS deployments I utilize were defined with. Probably just UFS, or maybe NFS. Did you have a recommended option yourself. The UFS driver is the one used for Lustre if no other one exists. Besides the fact that most of the adio that were created over the years are completely obsolete and could be cleaned from ROMIO, what will the new one for Lustre offer? Particularly with respect to controls via the lfs utility that I can already get? There is improved collective IO that aligns the IO on Lustre stripe boundaries. Also the hints given to the MPIIO layer (before open, not after) result in lustre picking a better stripe count/size. In addition, the one integrated into MPICH2-1.0.7 contains direct I/O support. Lockless I/O support was purged out due into my lack of confidence in low-level file system support. But it can be revived when possible. -- Weikuan Yu <+> 1-865-574-7990 http://ft.ornl.gov/~wyu/ <http://ft.ornl.gov/%7Ewyu/>
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
