Re: [Pvfs2-developers] OrangeFS

Sam Lang Tue, 04 Jan 2011 15:07:08 -0800

On Jan 3, 2011, at 1:49 PM, Rob Ross wrote:

> Hi Julian,
> 
> Probably I'm being slow, just coming back from the holidays, but I think that 
> the issue is that your data is noncontiguous in memory? Current ROMIO doesn't 
> do buffering into a contiguous region prior to writing to PVFS (i.e., data 
> sieving on writes is disabled). Looking at the PVFS2 ADIO implementation, it 
> appears that by default we would instead create an hindexed PVFS type and let 
> PVFS do the work (RobL can verify).


If it were generating a hindexed type in PVFS (PVFS_Request_hindexed()), I 
would not expect a bunch of small-io operations (125-bytes each) from client to 
server as Julian described, but rather a normal IO operation with the hindexed 
PVFS type sent along.

Is it possible that ROMIO in your instance is using POSIX read/write calls 
rather than PVFS calls directly?  If the PVFS filesystem is mounted, you may 
need to specify the filename with pvfs2:/path/to/file.

-sam.


> 
> This is sort of too bad, because in the "contiguous in file" case data 
> sieving would be just fine. Opportunity lost.
> 
> Do we agree on what is happening here?
> 
> Great to hear that OrangeFS is comparing well.
> 
> Regards,
> 
> Rob
> 
> On Dec 29, 2010, at 6:23 PM, Julian Kunkel wrote:
> 
>> Dear Rob & others,
>> regarding derived datatypes & PVFS2 I observed the following with MPICH2 
>> 1.3.2
>> and either PVFS 2.8.1 or orangefs-2.8.3-20101113.
>> 
>> I use a derived memory datatype to write (append) the diagonal of a
>> matrix to a file in MPI.
>> The data itself is written in a contiguous manor (without applying any
>> file view).
>> Therefore, I would expect that within MPI the data is written in a
>> contiguous way to the file,
>> however what I observe is that many small writes by using the
>> small-io.sm are used.
>> The volume of the data (e.g. the matrix diagonal) is 64072 byte and
>> starts in the file at offset 41, each I/O generates 125 small-io with
>> a size of 512 Bytes and one with 72 byte.
>> 
>> In Trove (alt-aio) I can observe the sequence of writes including
>> offset and sizes as follows:
>> <e t="3" time="11.629352" size="4" offset="41"/><un t="3"
>> time="11.629354"/><rel t="4" time="11.630205" p="0:9"/><s
>> name="alt-io-write" t="4" time="11.630209"/><e t="4" time="11.630223"
>> size="512" offset="45"/><un t="4" time="11.630225"/><rel t="5"
>> time="11.631027" p="0:10"/><s name="alt-io-write" t="5"
>> time="11.631030"/><e t="5" time="11.631045" size="512"
>> offset="557"/><un t="5" time="11.631047"/><rel t="6" time="11.631765"
>> p="0:11"/><s name="alt-io-write" t="6" time="11.631769"/><e t="6"
>> time="11.631784" size="512" offset="1069"/><un t="6"
>> time="11.631786"/><rel t="7" time="11.632460" p="0:12"/><s
>> name="alt-io-write" t="7" time="11.632464"/><e t="7" time="11.632483"
>> size="512" offset="1581"/>
>> ....
>> <e t="129" time="11.695048" size="72" offset="64045"/>
>> 
>> The offsets increase linearly, therefore I could imagine something in
>> ROMIO might split the I/O up, because it might guess data on disk is
>> non-contiguous.
>> Here are some code snippets attached which produced this issue:
>> 
>> Initalization of the file:
>> MPI_File_open (MPI_COMM_WORLD, name, MPI_MODE_WRONLY |
>> MPI_MODE_CREATE, MPI_INFO_NULL,                  &fd_visualization);
>> 
>> /* construct datatype for parts of a Matrix diagonal */
>> MPI_Type_vector (myrows,      /*int count */
>>                  1,           /*int blocklen */
>>                  N + 2,       /*int stride */
>>                  MPI_DOUBLE,  /*MPI_Datatype old_type */
>>                  &vis_datatype);      /*MPI_Datatype *newtype */
>> MPI_Type_commit (&vis_datatype);
>> 
>> Per iteration:
>> rank0 writes iteration number extra - (I know its suboptimal):
>>     ret = MPI_File_write_at (fd_visualization,        /*MPI_File fh,  */
>>                              (MPI_Offset) (start_row + vis_iter * (N
>> + 1)) * sizeof (double) + (vis_iter - 1) * sizeof (int) + offset,
>> /*MPI_Offset offset,  */
>>                              &stat_iteration, /*void *buf, */
>>                              1,       /*int count,  */
>>                              MPI_INT, /*MPI_Datatype datatype,  */
>>                              &status  /* MPI_Status *status */
>> 
>> This generates the small writes:
>> ret = MPI_File_write_at (
>>                           fd_visualization,   /*MPI_File fh,  */
>>                           (MPI_Offset) (start_row + vis_iter * (N +
>> 1)) * sizeof (double) + vis_iter * sizeof (int) + offset, /*MPI_Offset
>> offset,  */
>>                           v,  /*void *buf, */
>>                           1,  /*int count,  */
>>                           vis_datatype,       /*MPI_Datatype datatype,  */
>>                           &status );    /* MPI_Status *status */
>> 
>> I attached a screenshot which shows the MPI activity and the server
>> activity of our tracing environment, here one can see that the
>> operations are processed sequentially on the server (one small request
>> is processed after another).
>> Before I might dig deeper into this issue, maybe you have an idea
>> about this issue.
>> 
>> By the way, I did some basic tests with the old instrumented version
>> of PVFS 2.8.1 vs. the instrumented OrangeFS and I'm happy about the
>> I/O performance improvements, on our Xeon Westmere cluster the
>> performance is also more predictable.
>> 
>> Thanks,
>> Julian
>> <orangefs1-small.png>_______________________________________________
>> Pvfs2-developers mailing list
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] OrangeFS

Reply via email to