On Jan 3, 2011, at 1:49 PM, Rob Ross wrote: > Hi Julian, > > Probably I'm being slow, just coming back from the holidays, but I think that > the issue is that your data is noncontiguous in memory? Current ROMIO doesn't > do buffering into a contiguous region prior to writing to PVFS (i.e., data > sieving on writes is disabled). Looking at the PVFS2 ADIO implementation, it > appears that by default we would instead create an hindexed PVFS type and let > PVFS do the work (RobL can verify).
If it were generating a hindexed type in PVFS (PVFS_Request_hindexed()), I would not expect a bunch of small-io operations (125-bytes each) from client to server as Julian described, but rather a normal IO operation with the hindexed PVFS type sent along. Is it possible that ROMIO in your instance is using POSIX read/write calls rather than PVFS calls directly? If the PVFS filesystem is mounted, you may need to specify the filename with pvfs2:/path/to/file. -sam. > > This is sort of too bad, because in the "contiguous in file" case data > sieving would be just fine. Opportunity lost. > > Do we agree on what is happening here? > > Great to hear that OrangeFS is comparing well. > > Regards, > > Rob > > On Dec 29, 2010, at 6:23 PM, Julian Kunkel wrote: > >> Dear Rob & others, >> regarding derived datatypes & PVFS2 I observed the following with MPICH2 >> 1.3.2 >> and either PVFS 2.8.1 or orangefs-2.8.3-20101113. >> >> I use a derived memory datatype to write (append) the diagonal of a >> matrix to a file in MPI. >> The data itself is written in a contiguous manor (without applying any >> file view). >> Therefore, I would expect that within MPI the data is written in a >> contiguous way to the file, >> however what I observe is that many small writes by using the >> small-io.sm are used. >> The volume of the data (e.g. the matrix diagonal) is 64072 byte and >> starts in the file at offset 41, each I/O generates 125 small-io with >> a size of 512 Bytes and one with 72 byte. >> >> In Trove (alt-aio) I can observe the sequence of writes including >> offset and sizes as follows: >> <e t="3" time="11.629352" size="4" offset="41"/><un t="3" >> time="11.629354"/><rel t="4" time="11.630205" p="0:9"/><s >> name="alt-io-write" t="4" time="11.630209"/><e t="4" time="11.630223" >> size="512" offset="45"/><un t="4" time="11.630225"/><rel t="5" >> time="11.631027" p="0:10"/><s name="alt-io-write" t="5" >> time="11.631030"/><e t="5" time="11.631045" size="512" >> offset="557"/><un t="5" time="11.631047"/><rel t="6" time="11.631765" >> p="0:11"/><s name="alt-io-write" t="6" time="11.631769"/><e t="6" >> time="11.631784" size="512" offset="1069"/><un t="6" >> time="11.631786"/><rel t="7" time="11.632460" p="0:12"/><s >> name="alt-io-write" t="7" time="11.632464"/><e t="7" time="11.632483" >> size="512" offset="1581"/> >> .... >> <e t="129" time="11.695048" size="72" offset="64045"/> >> >> The offsets increase linearly, therefore I could imagine something in >> ROMIO might split the I/O up, because it might guess data on disk is >> non-contiguous. >> Here are some code snippets attached which produced this issue: >> >> Initalization of the file: >> MPI_File_open (MPI_COMM_WORLD, name, MPI_MODE_WRONLY | >> MPI_MODE_CREATE, MPI_INFO_NULL, &fd_visualization); >> >> /* construct datatype for parts of a Matrix diagonal */ >> MPI_Type_vector (myrows, /*int count */ >> 1, /*int blocklen */ >> N + 2, /*int stride */ >> MPI_DOUBLE, /*MPI_Datatype old_type */ >> &vis_datatype); /*MPI_Datatype *newtype */ >> MPI_Type_commit (&vis_datatype); >> >> Per iteration: >> rank0 writes iteration number extra - (I know its suboptimal): >> ret = MPI_File_write_at (fd_visualization, /*MPI_File fh, */ >> (MPI_Offset) (start_row + vis_iter * (N >> + 1)) * sizeof (double) + (vis_iter - 1) * sizeof (int) + offset, >> /*MPI_Offset offset, */ >> &stat_iteration, /*void *buf, */ >> 1, /*int count, */ >> MPI_INT, /*MPI_Datatype datatype, */ >> &status /* MPI_Status *status */ >> >> This generates the small writes: >> ret = MPI_File_write_at ( >> fd_visualization, /*MPI_File fh, */ >> (MPI_Offset) (start_row + vis_iter * (N + >> 1)) * sizeof (double) + vis_iter * sizeof (int) + offset, /*MPI_Offset >> offset, */ >> v, /*void *buf, */ >> 1, /*int count, */ >> vis_datatype, /*MPI_Datatype datatype, */ >> &status ); /* MPI_Status *status */ >> >> I attached a screenshot which shows the MPI activity and the server >> activity of our tracing environment, here one can see that the >> operations are processed sequentially on the server (one small request >> is processed after another). >> Before I might dig deeper into this issue, maybe you have an idea >> about this issue. >> >> By the way, I did some basic tests with the old instrumented version >> of PVFS 2.8.1 vs. the instrumented OrangeFS and I'm happy about the >> I/O performance improvements, on our Xeon Westmere cluster the >> performance is also more predictable. >> >> Thanks, >> Julian >> <orangefs1-small.png>_______________________________________________ >> Pvfs2-developers mailing list >> [email protected] >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
