Dear Rob & others,
regarding derived datatypes & PVFS2 I observed the following with
MPICH2 1.3.2
and either PVFS 2.8.1 or orangefs-2.8.3-20101113.
I use a derived memory datatype to write (append) the diagonal of a
matrix to a file in MPI.
The data itself is written in a contiguous manor (without applying
any
file view).
Therefore, I would expect that within MPI the data is written in a
contiguous way to the file,
however what I observe is that many small writes by using the
small-io.sm are used.
The volume of the data (e.g. the matrix diagonal) is 64072 byte and
starts in the file at offset 41, each I/O generates 125 small-io
with
a size of 512 Bytes and one with 72 byte.
In Trove (alt-aio) I can observe the sequence of writes including
offset and sizes as follows:
<e t="3" time="11.629352" size="4" offset="41"/><un t="3"
time="11.629354"/><rel t="4" time="11.630205" p="0:9"/><s
name="alt-io-write" t="4" time="11.630209"/><e t="4"
time="11.630223"
size="512" offset="45"/><un t="4" time="11.630225"/><rel t="5"
time="11.631027" p="0:10"/><s name="alt-io-write" t="5"
time="11.631030"/><e t="5" time="11.631045" size="512"
offset="557"/><un t="5" time="11.631047"/><rel t="6"
time="11.631765"
p="0:11"/><s name="alt-io-write" t="6" time="11.631769"/><e t="6"
time="11.631784" size="512" offset="1069"/><un t="6"
time="11.631786"/><rel t="7" time="11.632460" p="0:12"/><s
name="alt-io-write" t="7" time="11.632464"/><e t="7"
time="11.632483"
size="512" offset="1581"/>
....
<e t="129" time="11.695048" size="72" offset="64045"/>
The offsets increase linearly, therefore I could imagine something
in
ROMIO might split the I/O up, because it might guess data on disk is
non-contiguous.
Here are some code snippets attached which produced this issue:
Initalization of the file:
MPI_File_open (MPI_COMM_WORLD, name, MPI_MODE_WRONLY |
MPI_MODE_CREATE, MPI_INFO_NULL, &fd_visualization);
/* construct datatype for parts of a Matrix diagonal */
MPI_Type_vector (myrows, /*int count */
1, /*int blocklen */
N + 2, /*int stride */
MPI_DOUBLE, /*MPI_Datatype old_type */
&vis_datatype); /*MPI_Datatype *newtype */
MPI_Type_commit (&vis_datatype);
Per iteration:
rank0 writes iteration number extra - (I know its suboptimal):
ret = MPI_File_write_at (fd_visualization, /*MPI_File
fh, */
(MPI_Offset) (start_row + vis_iter * (N
+ 1)) * sizeof (double) + (vis_iter - 1) * sizeof (int) + offset,
/*MPI_Offset offset, */
&stat_iteration, /*void *buf, */
1, /*int count, */
MPI_INT, /*MPI_Datatype datatype, */
&status /* MPI_Status *status */
This generates the small writes:
ret = MPI_File_write_at (
fd_visualization, /*MPI_File fh, */
(MPI_Offset) (start_row + vis_iter * (N +
1)) * sizeof (double) + vis_iter * sizeof (int) + offset, /
*MPI_Offset
offset, */
v, /*void *buf, */
1, /*int count, */
vis_datatype, /*MPI_Datatype
datatype, */
&status ); /* MPI_Status *status */
I attached a screenshot which shows the MPI activity and the server
activity of our tracing environment, here one can see that the
operations are processed sequentially on the server (one small
request
is processed after another).
Before I might dig deeper into this issue, maybe you have an idea
about this issue.
By the way, I did some basic tests with the old instrumented version
of PVFS 2.8.1 vs. the instrumented OrangeFS and I'm happy about the
I/O performance improvements, on our Xeon Westmere cluster the
performance is also more predictable.
Thanks,
Julian
<orangefs1-small.png>_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers