Pete Wyckoff wrote:
I've been debugging why the metadata server calls fdatasync() five
times during a single create operation.  (IO server separate and
not considered here.)

In fs.conf, I had these StorageHints settings

    TroveSyncMeta no
    TroveSyncData no
    CoalescingHighWatermark infinity
    CoalescingLowWatermark 1

(defaults from pvfs2-genconfig with trovesync off).

The login in dbpf-sync.c goes like this:

    if (!metadata_sync)
        ++coalesce_count
        if (high_watermark > 0 && coalesce_count >= high_watermark)
            coalesce_count = 0
            sync
        if (num_pending_TROVE_SYNC_operations < low_watermark)
            coalesce_count = 0
            sync

No matter how low the low watermark, any trove operation marked as
TROVE_SYNC will cause a full sync.  Changing
"CoalescingLowWatermark" to 0 fixed that---no syncs

Do I understand this correctly?  Is setting the low WM to zero what
was intended?  Any non-zero value of low WM will always cause
immediate sync after every TROVE_SYNC operation---was this planned?

The intended behavior with TroveSyncMeta=no is to allow trove operations marked as TROVE_SYNC to be completed immediately. It does this by moving the operation to the completion queue immediately as the first statement inside the if(!metadata_sync) block. This allows the server thread to push through those operations (go to the next state actions, return responses, etc.) without waiting for the sync. That said, the code you are referring to will behave in the way you described under 'low load' conditions. If there are no other operations in the dbpf op queue marked TROVE_SYNC (or less than whatever LWM is set to) when that second check is made, we sync. By setting the LWM to 0, you're essentially saying that you don't want to ever sync under low load conditions.

I'd like to have it sync every 5--10 ops, or from a timeout.  Is
there some sort of idea that these TROVE_SYNC operations are so
special that they must run immediately, every time?

The behavior of syncing every operation should only happen under low load, and other than delaying other operations that get posted during that sync, there shouldn't be any performance differences from not syncing at all. That's the idea anyway. Once more operations are queued (meaning they're not getting serviced immediately), the per-operation sync doesn't happen.


The five syncing MD operations in a create, for those keeping score,
are:

    create dspace_create (sync)
    setattr metafile distribution (sync)
    setattr dspace_setattr (sync)
    crdirent write_directory_entry (sync)
    crdirent dspace_setattr (sync)


If you look at this though, its only doing one sync per-request per-database:

request 1:     create dspace_create (dspace sync)
request 2:     setattr metafile distribution (keyval sync)
request 2:     setattr dspace_setattr (dspace sync)
request 3:     crdirent write_directory_entry (keyval sync)
request 3:     crdirent dspace_setattr (dspace sync)

That's a lot of sync on both dspace and keyval dbs.  The total sync
time adds 45 ms to the overall operation on a SATA disk.

I agree, but we don't at present group requests, so there's no way to tell the trove layer that an operation doesn't need to be synced, because another is coming right behind it. We've talked about methods and techniques to fix this, but as I see it, there is information loss from client to server, and then further from server state-machines to trove layer. Murali has been suggesting that we do transactions over an entire PVFS system interface call, which would only require two syncs (one for each db), but that means distributed transactions. :-) Julian's request-id work might be useful to us in figuring out whether to wait for a sync, esp. for the create case. I'm not sure the behavior would be much different than what we have now though, the design of the sync coalescing code is really meant to perform well...err better (sync less frequently) under high-load conditions, since under load-load conditions it really shouldn't matter that you're syncing every time.

Just curious, you mentioned 5 calls to fdatasync() in a single create. That _should not_ happen, and is a bug if it does. Its the db->sync call that we make 5 times (potentially, depending on parameters and load). Are you seeing fdatasync() for metadata operations? Also, have you see a big drop in metadata performance?

Let me know.

Thanks,

-sam

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to