Pete Wyckoff wrote:
I've been debugging why the metadata server calls fdatasync() five
times during a single create operation. (IO server separate and
not considered here.)
In fs.conf, I had these StorageHints settings
TroveSyncMeta no
TroveSyncData no
CoalescingHighWatermark infinity
CoalescingLowWatermark 1
(defaults from pvfs2-genconfig with trovesync off).
The login in dbpf-sync.c goes like this:
if (!metadata_sync)
++coalesce_count
if (high_watermark > 0 && coalesce_count >= high_watermark)
coalesce_count = 0
sync
if (num_pending_TROVE_SYNC_operations < low_watermark)
coalesce_count = 0
sync
No matter how low the low watermark, any trove operation marked as
TROVE_SYNC will cause a full sync. Changing
"CoalescingLowWatermark" to 0 fixed that---no syncs
Do I understand this correctly? Is setting the low WM to zero what
was intended? Any non-zero value of low WM will always cause
immediate sync after every TROVE_SYNC operation---was this planned?
The intended behavior with TroveSyncMeta=no is to allow trove operations
marked as TROVE_SYNC to be completed immediately. It does this by
moving the operation to the completion queue immediately as the first
statement inside the if(!metadata_sync) block. This allows the server
thread to push through those operations (go to the next state actions,
return responses, etc.) without waiting for the sync. That said, the
code you are referring to will behave in the way you described under
'low load' conditions. If there are no other operations in the dbpf op
queue marked TROVE_SYNC (or less than whatever LWM is set to) when that
second check is made, we sync. By setting the LWM to 0, you're
essentially saying that you don't want to ever sync under low load
conditions.
I'd like to have it sync every 5--10 ops, or from a timeout. Is
there some sort of idea that these TROVE_SYNC operations are so
special that they must run immediately, every time?
The behavior of syncing every operation should only happen under low
load, and other than delaying other operations that get posted during
that sync, there shouldn't be any performance differences from not
syncing at all. That's the idea anyway. Once more operations are
queued (meaning they're not getting serviced immediately), the
per-operation sync doesn't happen.
The five syncing MD operations in a create, for those keeping score,
are:
create dspace_create (sync)
setattr metafile distribution (sync)
setattr dspace_setattr (sync)
crdirent write_directory_entry (sync)
crdirent dspace_setattr (sync)
If you look at this though, its only doing one sync per-request
per-database:
request 1: create dspace_create (dspace sync)
request 2: setattr metafile distribution (keyval sync)
request 2: setattr dspace_setattr (dspace sync)
request 3: crdirent write_directory_entry (keyval sync)
request 3: crdirent dspace_setattr (dspace sync)
That's a lot of sync on both dspace and keyval dbs. The total sync
time adds 45 ms to the overall operation on a SATA disk.
I agree, but we don't at present group requests, so there's no way to
tell the trove layer that an operation doesn't need to be synced,
because another is coming right behind it. We've talked about methods
and techniques to fix this, but as I see it, there is information loss
from client to server, and then further from server state-machines to
trove layer. Murali has been suggesting that we do transactions over an
entire PVFS system interface call, which would only require two syncs
(one for each db), but that means distributed transactions. :-)
Julian's request-id work might be useful to us in figuring out whether
to wait for a sync, esp. for the create case. I'm not sure the behavior
would be much different than what we have now though, the design of the
sync coalescing code is really meant to perform well...err better (sync
less frequently) under high-load conditions, since under load-load
conditions it really shouldn't matter that you're syncing every time.
Just curious, you mentioned 5 calls to fdatasync() in a single create.
That _should not_ happen, and is a bug if it does. Its the db->sync
call that we make 5 times (potentially, depending on parameters and
load). Are you seeing fdatasync() for metadata operations? Also, have
you see a big drop in metadata performance?
Let me know.
Thanks,
-sam
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers