Re: [ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Swaro, James E Fri, 01 May 2020 08:18:15 -0700

> This appears to be based on Intel's original libfabric proposal?  

This is material from about 6-12 months ago. I wasn't aware of Intel's proposal 
when this was drafted. This proposal is being put forward to discuss additional 
enhancements to support use cases discussed at the tail end of the proposal. I 
think there is plenty of room for additional material or improvements.


> Commit is not a valid term.

What 'commit' is named doesn't matter much to me. As long as we agree on the 
functionality of the feature, I'm not particularly picky about names. I think 
from the slides I've seen, 'flush' in the slides is equivalent to the 'commit' 
I've suggested here. 

> Complete RDMA memory placement extension support looks different than it did 
> in that original proposal.  

I've looked at the slides and the IBTA proposal. It looks to me like it should 
be supported by this proposal. Please point out any gaps so we can address 
them. 

I think this proposal is still well-aligned with the IBTA draft. If you don't 
think so, feel free to point out where and we can work through the 
details/definition. 

> Does it support RDMA Flush, Write Atomic and Verify?

It should. RDMA flush and write atomic and verify should be supported via 
existing functions (fi_writemsg/fi_atomic w/ FI_COMMIT_COMPLETE). A 'flush' 
seems to be analogous to 'fi_commit' in the proposal. 

> How do you register cached vs uncached pmem?

I don't think I understand the question, so please explain with more detail.  
What effect does cached or uncached PMEM have on the definition of the 
top-level API? 

> Is this already in the wild?  If not we shouldn’t release it without further 
> consideration.

This is a proposal for a modification to libfabric. This is not in master, this 
is simply a proposal. 

--
James Swaro
P: +1 (651) 605-9000

On 5/1/20, 9:51 AM, "Douglas, Chet R" <[email protected]> wrote:

    It matters!  Eventually (now?) we want full RDMA extension support in 
libfabrics, libibverbs, and the verbs spec.  This appears to be based on 
Intel's original libfabric proposal?  Commit is not a valid term. Complete RDMA 
memory placement extension support looks different than it did in that original 
proposal.  We need to architect the complete solution.  Don’t we?  Does it 
support RDMA Flush, Write Atomic and Verify?  How do you register cached vs 
uncached pmem?  Is this already in the wild?  If not we shouldn’t release it 
without further consideration.
    
    -----Original Message-----
    From: ofiwg <[email protected]> On Behalf Of Rupert Dance 
- SFI
    Sent: Friday, May 01, 2020 8:31 AM
    To: 'Swaro, James E' <[email protected]>; Hefty, Sean 
<[email protected]>; [email protected]
    Subject: Re: [ofiwg] Proposal for enhancement to support additional 
Persistent Memory use cases (ofiwg/libfabric#5874)
    
    Is this team aware of what the IBTA is doing with PME or does it not matter 
since it is libfabrics?
    
    -----Original Message-----
    From: ofiwg <[email protected]> On Behalf Of Swaro, James 
E
    Sent: Friday, May 01, 2020 9:41 AM
    To: Hefty, Sean <[email protected]>; [email protected]
    Subject: Re: [ofiwg] Proposal for enhancement to support additional 
Persistent Memory use cases (ofiwg/libfabric#5874)
    
    >    >      *       This allows a memory region to be registered as being 
capable of
    >    > persistence. This has already been introduced into the upstream 
libfabric GITHUB, but
    >    > should be reviewed to ensure it matches use case requirements.
        
    >    FI_RMA_PMEM is defined as a MR flag.  Note that this definition 
intentionally limits non-RMA transfers from taking advantage of persistent 
memory semantics.
        
    >    The intent of this flag is to give providers implementation 
flexibility, specifically based on hardware/software differences.
        
    Understood. The intent of this section of the proposal was to outline 
potential areas for change. Any questions posed here were historical and meant 
to provoke discussion. They might even be a little dated. Those changes and the 
rationale are discussed below.
    
    
    >    > every operation. That type of operation would make the delivery of 
completion events
    >    > take longer than necessary for most operations, so SHMEM would need 
finer control over
    >    > commit flushing behavior.
        
    >    OFI does not require that an event be generated for every transfer.  
It also allows transfers to report completions using 'lower' completion 
semantics, such as FI_TRANSMIT_COMPLETE.  Completion events at the target of an 
RMA write requires the FI_RMA_EVENT capability, and is independent from PMEM.
    
    Understood. This paragraph was intended to address a complication that was 
raised in one of the meetings. 
    
    It was discussed that with some applications, all or most data would be 
required to be persistent. The solution at the time was to provide 
FI_COMMIT_COMPLETE as part of the default TX op_flags at the time, which would 
incur a higher cost to provide that level of completion. The goal with this 
proposal would be to allow upper layers to set a less strict completion model, 
such as delivery or transmit complete as part of the default op_flag, or 
per-operation flag and address persistence as a batch operation via the 
fi_commit API.  
        
    
    >    >      *       A single request to fi_commit should generate a control 
message to target
    >    > hardware or software emulation environment to flush the contents of 
memory targets.
        
    >    This needs to be defined in terms of application level semantics, not 
implementation details.  fi_commit could be a no-op based on the provider 
implementation.  (It actually would be for the socket and tcp providers, which 
act at the target based on the MR flag.)
    
    Completely agree. Rereading this proposal, I meant to change some of these 
discussion points away from implementation to a discussion on behavior and 
semantics. How fi_commit behaves w.r.t implementation specifics isn't  within 
the scope of this proposal. Implementation details are something I'd prefer to 
stay away from so we can define how we expect it to behave.  
        
    >     > flexibility in the API design to future proof against options we 
might not conceive of
    >    > until after the prototype is complete, and the context available for 
the user and
    >    > returned with the completion
        
    >    The proposed definition is limited to RMA (and atomic) writes.  There 
is no mechanism for handling RMA reads into persistent memory, for example.  
That should be included.  Message transfers may need a separate mechanism for 
this.  That can be deferred (left undefined by the man pages), but should 
ideally we should have an idea for how to support it.
        
    >    The best existing API definition for an fi_commit call would be the 
fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by adding a 
flag.
    
    The proposed definition is limited to RMA and AMO because we didn't have a 
strong use case for messaging, but I'd like to go the route that allows 
messaging to be easily included if that changes later down the road.
    
    
    >    >      *       Since this API behaves like a data transfer API, it is 
expected that this
    >    > API would generate a completion event to the local completion queue 
associated with the
    >    > EP from which the transaction was initiated against.
        
    >    The generation of a *CQ* event makes sense.  We need to define if and 
how counters, locally and remote, are updated.  EQ events are not the right API 
match.
    
    Agreed on the CQ aspect. As a note, EQs are not being discussed for the 
initiator, only the target, so I'll put my EQ comments in the next comment. As 
a general comment, I think that this could be a good candidate for discussion 
at the next OFIWG because it is a strange grey area to me. 
    
    >    >      *       At the target, this should generate an event to the 
target's event queue –
    >    > if and only if the provider supports software emulated events. If a 
provider is capable
    >    > of hardware level commits to persistent memory, the transaction 
should be consumed
    >    > transparently by the hardware, and does not need to generate an 
event at the target.
    >    > This will require an additional event definition in libfabric (See 
definition for
    >    > fi_eq_commit_entry)
        
    >    This too needs to be defined based on the application level semantics, 
not implementation.  The app should not be aware of implementation differences, 
except where mode bits dictate for performance reasons.  (And I can say that 
developers hate dealing with those differences, so we need to eliminate them.)
        
    >    If we limit commit to RMA transfers, it makes sense for it to act as 
an RMA call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, the 
ability to carry CQ data and generate remote events (FI_RMA_EVENTS) on the 
target CQ and counters.  We also need to consider if there's any impact on 
counters associated with the MR.
    
    I agree that this needs to be defined in terms of application-level 
behavior. However, I do think we need to talk about if and how applications 
should be expected to facilitate the desired functionality if the hardware is 
not capable of it.  The 'how' aspect of a provider like sockets implements the 
functionality isn't important to define here, but if the provider needs the 
application to interact/configure in a specific way then I think that should be 
covered here. If there isn’t hardware support for FI_COMMIT_COMPLETE, then it 
seems to become a much more difficult problem. Libfabric could provide events 
to the application through EQ or CQ events, or go a similar route as HMEM is 
going now. I'd prefer to provide events to the application rather than attempt 
to support every PMEM library/hardware when handling the software emulation 
case.      
        
    >    > *    A new EQ event definition (fi_eq_commit_entry) to support 
software-emulated
    >    > persistence for devices that cannot provide hardware support
    >    > 
    >    >      *       The iov, and count variables mirror the original iov, 
and count contents of
    >    > the originating request.
    >    >      *       The flags may be a diminished set of flags from the 
original transaction
    >    > under the assumption that only some flags would have meaning at the 
target and sending
    >    > originator-only flags to the target would have little value to the 
target process.
        
    >    If any events are generated, they need to be CQ related, not EQ.
    
    This is where I believe it becomes a grey area. I could see using 
FI_RMA_EVENT or something similar to provoke a CQ event generated at the 
target, but it doesn't feel like fi_commit is a data transfer operation. It 
seems like a control operation, which is another reason why it was defined as 
generating an EQ event. The commit/"flush" is a control operation so it feels 
aligned with EQ. 
    
    
    >    > *    Additional flags or capabilities
    >    > 
    >    >      *       A provider should be able to indicate whether they 
support software
    >    > emulated notifications of fi_commit, or whether they can handle 
hardware requests for
    >    > commits to persistent memory
        
    >    The implementation of hardware vs software should not be exposed.  
Hybrid solutions (e.g. RxM or large transfers over verbs devices) are also 
possible.
    
    If libfabric provides an event to the upper layer, I believe libfabric can 
support many more persistent memory models and devices by propagating events to 
the upper layer than if we attempt to put that capability into libfabric and 
support it transparently for the user. It's just my view, but application 
writers have asked us to optimize data transfers over the network with the 
abstraction we provide. I think. This could be another complicated topic and we 
could discuss it at the next OFIWG. 
    
    
    >     The FI_RMA_PMEM capability should be sufficient to indicate support 
for RMA reads and writes to persistent memory.  That should be an inclusive 
flag (along with the API version) indicating that all related operations are 
supported.
    
    Something like this?
    
    #define FI_PMEM  (FI_RMA_PMEM | FI_AMO_PMEM | FI_MSG_PMEM)
    
    
    >      Support for messaging requires additional definitions.  Part of the 
discussion is figuring out the scope of what should be defined in the short 
term.  As mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to commit 
message transfers.  I can't think of a better alternative here.  However, I'm 
not sure if the proposed IBTA and IETF specifications will result in hardware 
capable of supporting the FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/
        
    
    Agreed on messaging, but it lacks a good use case yet so I haven't been as 
concerned. 
    
    I'm not yet convinced on FI_COMMIT_COMPLETE|FI_FENCE. If libfabric 
suggested the use of that, does that imply that providers must support 0-length 
sends and/or control messaging on behalf of the application ? Does the data 
transfer itself provide any context to the region being flushed? What happens 
in the case of multiple persistent memory domains or devices? How would that 
data transfer provide the context necessary to flush a specific region, memory 
domain, or device? This seems more complicated than the initial suggestion 
indicates.
    
    >    > *    Addition of an event handler registration for handling event 
queue entries within
    >    > the provider context (See Definition: fi_eq_event_handler)
    >    > 
    >    >      *       Essentially, this becomes a registered callback for the 
target application
    >    > to handle specific event types. We can use this mechanism with the 
target application
    >    > to allow the provider to handle events internally using a function 
provided by the
    >    > application. The function would contain the logic necessary to 
handle the event
        
    >    Callbacks are to be avoided.  They present difficult locking scenarios 
with severe restrictions on what the application can do from the callback, and 
present challenging object destruction situations.  Those restrictions can be 
difficult for an application to enforce, since calls outside the app to other 
libraries may violate them.
    
    It's a good argument, and generally I feel the same way. What do you 
suggest as an alternative? Callbacks were suggest as a way for the provider to 
do some behavior on behalf of the application upon the receipt of the 
associated event. This would have allowed the provider to issue the 
commit/flush to device and then return the ACK back to the initiator that the 
commit had succeeded/data was flushed as requested. Without a callback, I do 
not see a clean way for libfabric to coordinate flush and acknowledgement back 
to the initiator. 
    
    >    To be clear, the proposal only supports RMA writes, and maybe atomics, 
to the target memory.  That is likely sufficient for now, but I'd like to 
ensure that we have a way to extend pmem support beyond the limited use cases 
being discussed.
        
    RMA, and atomics -- with the intent not to exclude messaging. This is why 
the naming change from FI_RMA_PMEM to FI_PMEM was suggested.     
    
    
    >    >      *       Previous functionality allows for a commit for every 
message as is the case
    >    > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-transaction 
basis. The need in
    >    >  ...
    >    > delivery model, and provides a mechanism to ensure that those data 
transfers are
    >    > eventually persisted.
        
    >    Unless the app has set FI_COMMIT_COMPLETE as the default completion 
model, it only applies to the operation on which it was set.  The main gap I'm 
aware of with proposed specifications is support of a 'flush' type semantic.
    
    The flush mechanic is the primary gap that the proposal is attempting to 
identify. However, I believe the software emulation elements of the proposal 
are valuable for prototyping efforts. 
    
    --
    James Swaro
    P: +1 (651) 605-9000
    
    On 4/27/20, 9:38 PM, "Hefty, Sean" <[email protected]> wrote:
    
        Top-posting main discussion point.  Other comments further down:
        
        Conceptually, what's being proposed is specifying a data transfer as a 
2-step process.
        
        1. identify the data source and target
        2. specify the completion semantic
        
        Theoretically, the actual data transfer can occur any time after step 1 
and before step 2 completes.  As an additional optimization, step 2 can apply 
to multiple step 1s.
        
        We need to decide:
        
        A. What completion semantic applies to step 1?
        B. What operations do we support for step 1?
        C. What completion semantics are supported for step 2?
        
        The current answers are:
        
        A. All completion levels are supported.  It's possible that none of 
them are desirable here, and we need to introduce a new mode: 
FI_UNDEFINED_COMPLETE.  This would indicate that the buffer cannot be re-used, 
and the data is not visible at the target, until step 2 completes that covers 
the same target memory range.
        
        B. RMA reads and writes are supported.  It shouldn't be difficult to 
support atomics through the same APIs as well.  Message transfers are more 
difficult to specify in step 2, making them harder to support.
         
        C. The proposal only supports FI_COMMIT_COMPLETE.  Other levels could 
be added, though that may only make sense if we define something like 
FI_UNDEFINED_COMPLETE.
        
        I'm throwing FI_UNDEFINED_COMPLETE out for discussion.  There would be 
issues trying to define it, since data transfers issued at step 1 could 
generate completions locally and remotely prior to step 2 being invoked.  Those 
completions just wouldn't mean anything until step 2 completes.  The provider 
would select the best completion option for step 1.
        
        
        > Libfabric requires modifications to support RMA and atomic operations 
targeted at
        > remote memory registrations backed by persistent memory devices. 
These modifications
        > should be made with the intent to drive support for persistent memory 
usage by
        > applications that rely on communications middleware such as SHMEM in 
a manner that is
        > consistent with byte-based/stream-based addressable memory formats. 
Existing proposals
        > (initial proposal) support NVMe/PMoF approaches, which this approach 
should support
        > flat memory, non-block addressed memory structures and devices.
        > 
        > Changes may be required in as many as three areas:
        > 
        > *     Memory registration calls
        > 
        >       *       This allows a memory region to be registered as being 
capable of
        > persistence. This has already been introduced into the upstream 
libfabric GITHUB, but
        > should be reviewed to ensure it matches use case requirements.
        
        FI_RMA_PMEM is defined as a MR flag.  Note that this definition 
intentionally limits non-RMA transfers from taking advantage of persistent 
memory semantics.
        
        The intent of this flag is to give providers implementation 
flexibility, specifically based on hardware/software differences.
        
        
        > *     Completion semantics
        > 
        >       *       These changes allow a completion event or notification 
to be deferred until
        > the referenced data has reached the persistence domain at the target. 
This has already
        > been introduced into the upstream libfabric GITHUB, but should be 
reviewed to ensure it
        > matches use case requirements.
        
        Completion semantics may be adjusted on a per transfer basis.  The 
FI_COMMMIT_COMPLETE semantic applies to both the initiator and target.  
Completion semantics are a minimal guarantee from a provider.  The provider can 
do more.
        
        > *     Consumer control of persistence
        > 
        >       *       As presently implemented in the upstream libfabric 
GITHUB, persistence is
        > determined on a transaction-by-transaction basis. It was acknowledged 
at the time that
        > this is a simplistic implementation. We need to reach consensus on 
the following:
        > 
        >               *       Should persistence be signaled on the basis of 
the target memory
        > region? For example, one can imagine a scheme where data targeted at 
a particular
        > memory region is automatically pushed into the persistence domain by 
the target,
        > obviating the need for any sort of commit operation.
        
        In cases where a commit operation is not needed, it can become a no-op, 
but it may be required functionality for some providers.
        
        
        >               *       Is an explicit 'commit' operation of some type 
required, and if so,
        > what is the scope of that commit operation? Is there a persistence 
fence defined such
        > that every operation prior to the fence is made persistent by a 
commit operation?
        
        With the current API, persistence can be achieved by issuing a 0-length 
RMA with FI_COMMIT_COMPLETE | FI_FENCE semantics.  The fence requires that 
*all* prior transfers over that endpoint meet the requested completion semantic.
        
        This may not be ideal, but may be the best way to handle message 
transfers to persistent memory.
        
        
        > Proposal
        > 
        > The experimental work in the OFIWG/libfabric branch is sufficient for 
the needs of
        > SHMEM, with exception to the granularity of event generation. When 
the current
        > implementation generates events, it would generate commit-level 
completion events with
        > every operation. That type of operation would make the delivery of 
completion events
        > take longer than necessary for most operations, so SHMEM would need 
finer control over
        > commit flushing behavior.
        
        OFI does not require that an event be generated for every transfer.  It 
also allows transfers to report completions using 'lower' completion semantics, 
such as FI_TRANSMIT_COMPLETE.  Completion events at the target of an RMA write 
requires the FI_RMA_EVENT capability, and is independent from PMEM.
        
        > To satisfy this, the following is being proposed:
        > 
        > *     A new API: fi_commit (See definitions: fi_commit)
        >       The new API would be used to generate a commit instruction to a 
target peer. The
        > instruction would be defined by a set of memory registration keys, or 
regions by which
        > the target could issue a commit to persistent memory.
        
        See discussion at the top.
        
        
        >       *       A single request to fi_commit should generate a control 
message to target
        > hardware or software emulation environment to flush the contents of 
memory targets.
        
        This needs to be defined in terms of application level semantics, not 
implementation details.  fi_commit could be a no-op based on the provider 
implementation.  (It actually would be for the socket and tcp providers, which 
act at the target based on the MR flag.)    
        
        > Memory targets are defined by the iov structures, and key fields – 
and the number of
        > memory targets are defined by the count field. The destination 
address is handled by
        > the dest_addr field. The flags field is held reserved at this time to 
allow for
        > flexibility in the API design to future proof against options we 
might not conceive of
        > until after the prototype is complete, and the context available for 
the user and
        > returned with the completion
        
        The proposed definition is limited to RMA (and atomic) writes.  There 
is no mechanism for handling RMA reads into persistent memory, for example.  
That should be included.  Message transfers may need a separate mechanism for 
this.  That can be deferred (left undefined by the man pages), but should 
ideally we should have an idea for how to support it.
        
        The best existing API definition for an fi_commit call would be the 
fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by adding a 
flag.
        
        >       *       Since this API behaves like a data transfer API, it is 
expected that this
        > API would generate a completion event to the local completion queue 
associated with the
        > EP from which the transaction was initiated against.
        
        The generation of a *CQ* event makes sense.  We need to define if and 
how counters, locally and remote, are updated.  EQ events are not the right API 
match.
        
        
        >       *       At the target, this should generate an event to the 
target's event queue –
        > if and only if the provider supports software emulated events. If a 
provider is capable
        > of hardware level commits to persistent memory, the transaction 
should be consumed
        > transparently by the hardware, and does not need to generate an event 
at the target.
        > This will require an additional event definition in libfabric (See 
definition for
        > fi_eq_commit_entry)
        
        This too needs to be defined based on the application level semantics, 
not implementation.  The app should not be aware of implementation differences, 
except where mode bits dictate for performance reasons.  (And I can say that 
developers hate dealing with those differences, so we need to eliminate them.)
        
        If we limit commit to RMA transfers, it makes sense for it to act as an 
RMA call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, the 
ability to carry CQ data and generate remote events (FI_RMA_EVENTS) on the 
target CQ and counters.  We also need to consider if there's any impact on 
counters associated with the MR.
        
        
        > *     A new EQ event definition (fi_eq_commit_entry) to support 
software-emulated
        > persistence for devices that cannot provide hardware support
        > 
        >       *       The iov, and count variables mirror the original iov, 
and count contents of
        > the originating request.
        >       *       The flags may be a diminished set of flags from the 
original transaction
        > under the assumption that only some flags would have meaning at the 
target and sending
        > originator-only flags to the target would have little value to the 
target process.
        
        If any events are generated, they need to be CQ related, not EQ.
        
        
        > *     Additional flags or capabilities
        > 
        >       *       A provider should be able to indicate whether they 
support software
        > emulated notifications of fi_commit, or whether they can handle 
hardware requests for
        > commits to persistent memory
        
        The implementation of hardware vs software should not be exposed.  
Hybrid solutions (e.g. RxM or large transfers over verbs devices) are also 
possible.
        
        
        >               *       An additional flag should be introduced to the 
fi_info structure
        > under modes: FI_COMMIT_MANUAL (or something else)
        
        The FI_RMA_PMEM capability should be sufficient to indicate support for 
RMA reads and writes to persistent memory.  That should be an inclusive flag 
(along with the API version) indicating that all related operations are 
supported.
        
        
        >                       *       This flag would indicate to the 
application that events may be
        > generated to the event queue for consumption by the application. 
Commit events would be
        > generated upon receipt of a commit message from a remote peer, and 
the application
        > would be responsible for handling the event.
        >                       *       Lack of the FI_COMMIT_MANUAL flag, and 
the presence of the
        > FI_RMA_PMEM (or FI_PMEM) flag in the info structure should imply that 
the hardware is
        > capable of handling the commit requests to persistent memory and the 
application does
        > not need to read the event queue for commit events.
        > 
        > *     Change of flag definition
        > 
        >       *       The FI_RMA_PMEM flag should be changed to FI_PMEM to 
indicate that the
        > provider is PMEM aware, and supports RMA/AMO/MSG operations to and 
from persistent
        > memory.
        >       *       There may be little value in supporting messaging 
interfaces, but it is
        > something that could supported.
        
        Support for messaging requires additional definitions.  Part of the 
discussion is figuring out the scope of what should be defined in the short 
term.  As mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to commit 
message transfers.  I can't think of a better alternative here.  However, I'm 
not sure if the proposed IBTA and IETF specifications will result in hardware 
capable of supporting the FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/
        
        
        > *     Addition of an event handler registration for handling event 
queue entries within
        > the provider context (See Definition: fi_eq_event_handler)
        > 
        >       *       Essentially, this becomes a registered callback for the 
target application
        > to handle specific event types. We can use this mechanism with the 
target application
        > to allow the provider to handle events internally using a function 
provided by the
        > application. The function would contain the logic necessary to handle 
the event
        
        Callbacks are to be avoided.  They present difficult locking scenarios 
with severe restrictions on what the application can do from the callback, and 
present challenging object destruction situations.  Those restrictions can be 
difficult for an application to enforce, since calls outside the app to other 
libraries may violate them.
        
        
        >       *       Specific to PMEM, a function handler would be used by 
the target
        > application to handle commits to persistent memory as they were 
delivered without
        > requiring a fi_eq_read and some form of acknowledgement around the 
commit action. With
        > the handler, the commit could be handled entirely by the function 
provided by the
        > application, and the return code from the application provided 
call-back would be
        > sufficient for a software emulation in the provider to produce the 
return message to
        > the sender that the commit transaction is fully complete. The use of 
a handler allows
        > us to make the commit transaction as light-weight, or heavy-weight as 
necessary.
        > 
        > Definitions:
        > 
        > fi_commit
        > 
        > ssize_t fi_commit(struct fid_ep *ep,
        > 
        >                              const struct fi_rma_iov *iov,
        > 
        >                              size_t count,
        > 
        >                              fi_addr_t dest_addr,
        > 
        >                              uint64_t flags,
        > 
        >                              void *context);
        > 
        > fi_eq_commit_entry
        > 
        > struct fi_eq_commit_entry {
        > 
        >     fid_t                       fid;            /* fid associated 
with request */
        > 
        >     const struct fi_rma_iov    *iov;            /* iovec of memory 
regions to be
        > committed to persistent memory */
        > 
        >     size_t                      count;          /* number of 
iovec/key entries */
        > 
        >     uint64_t                    flags;          /* operation-specific 
flags */
        > 
        > };
        > 
        > fi_eq_event_handler
        > 
        > typedef ssize_t (*fi_eq_event_handler_t)(struct fid_eq *eq,
        > 
        >     uint64_t event_type,
        > 
        >     void *event_data,
        > 
        >     uint64_t len,
        > 
        >     void *context);
        > 
        > ssize_t fi_eq_register_handler(struct fid_eq *eq,
        > 
        >     uint64_t event_type,
        > 
        >     fi_eq_event_handler_t handler,
        > 
        >     void *context);
        > 
        > Use cases supported by this proposal:
        > 
        > *     As an application writer, I need to commit multiple 
previously-sent data
        > transfers to the persistence domain
        
        To be clear, the proposal only supports RMA writes, and maybe atomics, 
to the target memory.  That is likely sufficient for now, but I'd like to 
ensure that we have a way to extend pmem support beyond the limited use cases 
being discussed.
        
        
        >       *       Previous functionality allows for a commit for every 
message as is the case
        > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-transaction 
basis. The need in
        > this use case is performance-oriented, to allow less strict delivery 
model to the NIC
        > for most messages followed up with a 'flush' of the NIC to the 
persistence domain. This
        > allows most messages targeted to the persistence domain to complete 
with a less strict
        > delivery model, and provides a mechanism to ensure that those data 
transfers are
        > eventually persisted.
        
        Unless the app has set FI_COMMIT_COMPLETE as the default completion 
model, it only applies to the operation on which it was set.  The main gap I'm 
aware of with proposed specifications is support of a 'flush' type semantic.
        
        
        - Sean
        
    
    _______________________________________________
    ofiwg mailing list
    [email protected]
    https://lists.openfabrics.org/mailman/listinfo/ofiwg 
    
    _______________________________________________
    ofiwg mailing list
    [email protected]
    https://lists.openfabrics.org/mailman/listinfo/ofiwg 
    

_______________________________________________
ofiwg mailing list
[email protected]
https://lists.openfabrics.org/mailman/listinfo/ofiwg

Re: [ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Reply via email to