Re: [Pvfs2-developers] OrangeFS

Rob Ross Tue, 04 Jan 2011 15:13:08 -0800

Ah, good point re: hindexed. -- Rob

On Jan 4, 2011, at 5:06 PM, Sam Lang wrote:

On Jan 3, 2011, at 1:49 PM, Rob Ross wrote:
Hi Julian,
Probably I'm being slow, just coming back from the holidays, but Ithink that the issue is that your data is noncontiguous in memory?Current ROMIO doesn't do buffering into a contiguous region priorto writing to PVFS (i.e., data sieving on writes is disabled).Looking at the PVFS2 ADIO implementation, it appears that bydefault we would instead create an hindexed PVFS type and let PVFSdo the work (RobL can verify).
If it were generating a hindexed type in PVFS(PVFS_Request_hindexed()), I would not expect a bunch of small-iooperations (125-bytes each) from client to server as Juliandescribed, but rather a normal IO operation with the hindexed PVFStype sent along.
Is it possible that ROMIO in your instance is using POSIX read/writecalls rather than PVFS calls directly? If the PVFS filesystem ismounted, you may need to specify the filename with pvfs2:/path/to/file.
-sam.
This is sort of too bad, because in the "contiguous in file" casedata sieving would be just fine. Opportunity lost.
Do we agree on what is happening here?

Great to hear that OrangeFS is comparing well.

Regards,

Rob

On Dec 29, 2010, at 6:23 PM, Julian Kunkel wrote:
Dear Rob & others,
regarding derived datatypes & PVFS2 I observed the following withMPICH2 1.3.2
and either PVFS 2.8.1 or orangefs-2.8.3-20101113.

I use a derived memory datatype to write (append) the diagonal of a
matrix to a file in MPI.
The data itself is written in a contiguous manor (without applyingany
file view).
Therefore, I would expect that within MPI the data is written in a
contiguous way to the file,
however what I observe is that many small writes by using the
small-io.sm are used.
The volume of the data (e.g. the matrix diagonal) is 64072 byte and
starts in the file at offset 41, each I/O generates 125 small-iowith
a size of 512 Bytes and one with 72 byte.

In Trove (alt-aio) I can observe the sequence of writes including
offset and sizes as follows:
<e t="3" time="11.629352" size="4" offset="41"/><un t="3"
time="11.629354"/><rel t="4" time="11.630205" p="0:9"/><s
name="alt-io-write" t="4" time="11.630209"/><e t="4"time="11.630223"
size="512" offset="45"/><un t="4" time="11.630225"/><rel t="5"
time="11.631027" p="0:10"/><s name="alt-io-write" t="5"
time="11.631030"/><e t="5" time="11.631045" size="512"
offset="557"/><un t="5" time="11.631047"/><rel t="6"time="11.631765"
p="0:11"/><s name="alt-io-write" t="6" time="11.631769"/><e t="6"
time="11.631784" size="512" offset="1069"/><un t="6"
time="11.631786"/><rel t="7" time="11.632460" p="0:12"/><s
name="alt-io-write" t="7" time="11.632464"/><e t="7"time="11.632483"
size="512" offset="1581"/>
....
<e t="129" time="11.695048" size="72" offset="64045"/>
The offsets increase linearly, therefore I could imagine somethingin
ROMIO might split the I/O up, because it might guess data on disk is
non-contiguous.
Here are some code snippets attached which produced this issue:

Initalization of the file:
MPI_File_open (MPI_COMM_WORLD, name, MPI_MODE_WRONLY |
MPI_MODE_CREATE, MPI_INFO_NULL,                  &fd_visualization);

/* construct datatype for parts of a Matrix diagonal */
MPI_Type_vector (myrows,      /*int count */
                1,           /*int blocklen */
                N + 2,       /*int stride */
                MPI_DOUBLE,  /*MPI_Datatype old_type */
                &vis_datatype);      /*MPI_Datatype *newtype */
MPI_Type_commit (&vis_datatype);

Per iteration:
rank0 writes iteration number extra - (I know its suboptimal):
ret = MPI_File_write_at (fd_visualization, /*MPI_Filefh, */
                            (MPI_Offset) (start_row + vis_iter * (N
+ 1)) * sizeof (double) + (vis_iter - 1) * sizeof (int) + offset,
/*MPI_Offset offset,  */
                            &stat_iteration, /*void *buf, */
                            1,       /*int count,  */
                            MPI_INT, /*MPI_Datatype datatype,  */
                            &status  /* MPI_Status *status */

This generates the small writes:
ret = MPI_File_write_at (
                         fd_visualization,   /*MPI_File fh,  */
                         (MPI_Offset) (start_row + vis_iter * (N +
1)) * sizeof (double) + vis_iter * sizeof (int) + offset, /*MPI_Offset
offset,  */
                         v,  /*void *buf, */
                         1,  /*int count,  */
vis_datatype, /*MPI_Datatypedatatype, */
                         &status );    /* MPI_Status *status */

I attached a screenshot which shows the MPI activity and the server
activity of our tracing environment, here one can see that the
operations are processed sequentially on the server (one smallrequest
is processed after another).
Before I might dig deeper into this issue, maybe you have an idea
about this issue.

By the way, I did some basic tests with the old instrumented version
of PVFS 2.8.1 vs. the instrumented OrangeFS and I'm happy about the
I/O performance improvements, on our Xeon Westmere cluster the
performance is also more predictable.

Thanks,
Julian
<orangefs1-small.png>_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] OrangeFS

Reply via email to