Hi Julian,

Thanks for sending out your thoughts and ideas. Even though we've talked about most of this offline, I'm just going to summarize what I said in case others want to comment.

-sam

On Jun 10, 2006, at 1:57 PM, Julian Martin Kunkel wrote:

Hi,
I looked a bit arround the implementation of the data sync mode,
currently the PINT_flow_setinfo is called which sets the sync mode for each write operation of a flow. That means if 100 MByte are transfered for blocks
with 256 Kbyte a sync happens, which ends up in quite a lot syncs.

Maybe it would be nice if the client could specify in the IO request
(PVFS_servreq_io) if the data should be synced instead of setting it per filesystem. Maybe the kernel interface can take benefit of this to save sync operations or this can be useful elsewhere ? Of course, this value can be
filled by default with the filesystems TroveSyncData option.
In MPI there is the explicit sync via MPI_File_sync, maybe we could rely on
this for MPI apps ?

This also requires an additional flag to be added to the parameters of PVFS_sys_io. The flag would specify whether to sync or not (or could be extended for other uses). This saves a roundtrip between client and server because the flag can be sent along with the IO request (as Julian proposes), instead of doing a separate flush operation.

When I was looking at the performance of small-io, the overall cost of doing an extra roundtrip was negligible once the IO request sizes were larger (~ 32K IIRC), so the benefit here may not be that great, and modifying the system interface may not make it worthwhile.

At the same time, in the use case where clients want to specify a data sync on a per IO request basis, allowing the server to know at the beginning of an IO operation that it needs to be synced may help improve the sync coalescing behavior, because it gives the server more time to determine if multiple IO ops can be synced together.


Independent of this questions, Rob mentioned that the sync policy maybe should be changed, too. For example to sync the data only after at the end of the flow and that data syncs could be coalesynced like the metadata coalesyncs.

This is a good idea. In fact it sounds like we can just replace the 'TroveSyncData on' semantics to sync at the end of the entire IO op instead of for each trove write call that flow makes. In other words, we don't need to provide the user with the config option to sync for each trove write.

I think maybe the coalesyncing of operations should be handled by the trove
module, because this knows which coalesync method is best for the
implementation or should this be handled by a upper layer (e.g. job ?).


I would put it in the dbpf layer. The queuing of operations is handled there (both metadata and io), so you can do you policy stuff most easily from there. The trove layer just acts as a wrapper for the underlying implementation, and the job layer is used by the server thread for testing completion. Since the request scheduler allows write ops on the same handle to be scheduled immediately, you should be able to manage everything in dbpf.

In case a I/O scheduler will be added to the Trove layer maybe small write requests can be combined like in ROMIO. Also the policy might depend on the
servers I/O load and pending I/O jobs.

The problem with doing this on the server is that its hard to know in advance that many small IO operations are being done together, unless they're all sitting in the queue waiting to be serviced. I like the pvfs2 stance of encouraging client-side data-sieving since in many cases clients aren't acting independently (if that is the pvfs2 stance, perhaps I'm projecting :-)). In our discussion yesterday RobL pointed out that the disk scheduler should be doing some amount of read-ahead, so assuming that the disk operations are the expensive part, doing many lio_listio calls instead of coalescing them into one call may not actually matter.


I will take care for modifications and evaluate possible policies if nobody
else is currently working on this issues.

Thanks,
Julian
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to