Sam Lang wrote:

On Aug 10, 2006, at 4:04 PM, Phil Carns wrote:

flow-proto-tuning.patch:
-----------
This patch adds "FlowBufferSizeBytes" and "FlowBuffersPerFlow" options to the configuration file format. They allow you to specify the buffer size that the default flow protocol will use as well as the maximum number of buffers to use per flow. Note that if you change either of these parameters, then you need to remount any active clients so that they pick up the configuration change before performing any I/O.

max-aio.patch:
----------
This patch adds "TroveMaxConcurrentIO" to the configuration file format. It allows you to specify the maximum number of I/O operations that trove will allow to proceed concurrently (currently 16). Note from the previous email regarding AIO that depending on your access pattern, AIO may queue all of your operations anyway regardless of this setting. It probably doesn't have much effect unless you are accessing more than one file at a time, or if you are using an alternative to the stock AIO implementation.


I had made the same change in Julian's branch,

Oh, ok.  We will switch over when that stuff hits trunk.

there are still a couple things that aren't clear to me about this max value though. First, its a global value for all outstanding lio_listio calls the pvfs server makes, but based on your previous email comments about glibc's one-thread-per-fd oddity, it seems like we only want that value to max out per datafile. Also, after we hit the max we just queue the operations and post them once current ops have completed. If librt just queues ops and does them in FIFO order though, its pretty much the same thing. Why not just let librt handle the queuing? If we were to do ordering of the operations based on offsets, then it would make sense for us to queue, but we don't. Are we better at queueing than librt?

I agree that if you are using librt for aio, then this max value isn't doing much of anything :) librt's queueing isn't exactly the same thing though. librt allows N operations in flight at a time (where N can be tuned using aio_init) by way of limiting the maximum number of threads that it will spawn. However, since it serializes on each fd, that limit never kicks in unless you are accessing N different files. Otherwise it is really only going to do one thing at a time. The librt source that I looked at happened to default N=16, just like Trove was.

I think the point of the aio limit in trove was to try to throttle I/O on the servers, but it turns out that librt was already throttling above an beyond; so the trove limit wasn't actually decreasing the number of posted I/O operations to the kernel any.

Maybe the throttling makes more sense when you bypass librt somehow (as in the previous patch) because then there is nothing to queue/throttle the operations besides trove?

At any rate, we decided to make this configurable before understanding the issues involved- it was just a hardcoded value we saw that looked like it should have been tunable.

I know Julian was looking at performance of aio and found results somewhere (I don't have a reference, sorry) that showed lio_listio did better in cases where multiple fds were passed to one lio_listio operation (right now we just do one fd with multiple segments to one lio_listio). I wonder if that difference is based on the glibc queuing behavior that you describe.

I would guess that the queueing behavior is the reason. I can't imagine that using seperate files would make much difference once you get to the system call level.

Just a curiousity, but I wonder if the aio performance would change if we were to post multiple trove operations in the same lio_listio call, or possibly even break up the bstream into multiple files based on strip size...sounds crazy right? :-)

On the former question, I guess it depends on who is better a coalescing- the kernel disk scheduler or the trove queue? Hopefully we can avoid splitting files up :)

-Phil

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to