Hi Julian,
Thanks for sending out your thoughts and ideas. Even though we've
talked about most of this offline, I'm just going to summarize what I
said in case others want to comment.
-sam
On Jun 10, 2006, at 1:57 PM, Julian Martin Kunkel wrote:
Hi,
I looked a bit arround the implementation of the data sync mode,
currently the PINT_flow_setinfo is called which sets the sync mode
for each
write operation of a flow. That means if 100 MByte are transfered
for blocks
with 256 Kbyte a sync happens, which ends up in quite a lot syncs.
Maybe it would be nice if the client could specify in the IO request
(PVFS_servreq_io) if the data should be synced instead of setting
it per
filesystem. Maybe the kernel interface can take benefit of this to
save sync
operations or this can be useful elsewhere ? Of course, this value
can be
filled by default with the filesystems TroveSyncData option.
In MPI there is the explicit sync via MPI_File_sync, maybe we could
rely on
this for MPI apps ?
This also requires an additional flag to be added to the parameters
of PVFS_sys_io. The flag would specify whether to sync or not (or
could be extended for other uses). This saves a roundtrip between
client and server because the flag can be sent along with the IO
request (as Julian proposes), instead of doing a separate flush
operation.
When I was looking at the performance of small-io, the overall cost
of doing an extra roundtrip was negligible once the IO request sizes
were larger (~ 32K IIRC), so the benefit here may not be that great,
and modifying the system interface may not make it worthwhile.
At the same time, in the use case where clients want to specify a
data sync on a per IO request basis, allowing the server to know at
the beginning of an IO operation that it needs to be synced may help
improve the sync coalescing behavior, because it gives the server
more time to determine if multiple IO ops can be synced together.
Independent of this questions, Rob mentioned that the sync policy
maybe should
be changed, too. For example to sync the data only after at the end
of the
flow and that data syncs could be coalesynced like the metadata
coalesyncs.
This is a good idea. In fact it sounds like we can just replace the
'TroveSyncData on' semantics to sync at the end of the entire IO op
instead of for each trove write call that flow makes. In other
words, we don't need to provide the user with the config option to
sync for each trove write.
I think maybe the coalesyncing of operations should be handled by
the trove
module, because this knows which coalesync method is best for the
implementation or should this be handled by a upper layer (e.g.
job ?).
I would put it in the dbpf layer. The queuing of operations is
handled there (both metadata and io), so you can do you policy stuff
most easily from there. The trove layer just acts as a wrapper for
the underlying implementation, and the job layer is used by the
server thread for testing completion. Since the request scheduler
allows write ops on the same handle to be scheduled immediately, you
should be able to manage everything in dbpf.
In case a I/O scheduler will be added to the Trove layer maybe
small write
requests can be combined like in ROMIO. Also the policy might
depend on the
servers I/O load and pending I/O jobs.
The problem with doing this on the server is that its hard to know in
advance that many small IO operations are being done together, unless
they're all sitting in the queue waiting to be serviced. I like the
pvfs2 stance of encouraging client-side data-sieving since in many
cases clients aren't acting independently (if that is the pvfs2
stance, perhaps I'm projecting :-)). In our discussion yesterday
RobL pointed out that the disk scheduler should be doing some amount
of read-ahead, so assuming that the disk operations are the expensive
part, doing many lio_listio calls instead of coalescing them into one
call may not actually matter.
I will take care for modifications and evaluate possible policies
if nobody
else is currently working on this issues.
Thanks,
Julian
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers