I have a few comments on the semantics of memory regions, and how it relates to usage scenarios for memory notifiers and/or page faulting.
First, there is nothing in RDMA semantics that demands that each page of a memory region be pre-mapped to a physical page before the page can be advertised remotely. What is expected is that these advertisements not be at risk. There has to be an honest expectation that if a 40 page buffer is advertised that there are 40 pages available to back that advertisement. It is simply unacceptable for one end of an RDMA connection to back up the network because it cannot plan its buffer allocations. Network retransmission is not a handy spare scratchpad where buffers can be "cached" via retransmission. This is somewhat akin to guaranteeing a landing slot for an airplane. You don't really need to 'pin' the landing resources for the specific plane for the entire duration of the flight, but you better have more than just good intentions to make your best effort to find somewhere for the plane to land when it finally arrives. When there is no buffer available, there has been a connection. Having failed to meet the requirements the receiver should assume that the connection will be torn down. But there is a little bit of wiggle room here. There is no need to mandate that the connection MUST be torn down. This was explicitly discussed by the IETF's RDDP working group while drafting the iWARP RFCs. If there is a fault, the connection MAY be torn down, but an implementation MAY take extra steps as part of a fault-tolerance strategy to avoid this. Dropping a packet and generating a page fault to the host as a fault-recovery strategy is a legitimate option. But applications MUST NOT rely on the transport layer having this service. It's somewhat like catching divide by zero errors. It's nice if the OS/library/compiler build in mechanisms to recover from divide by zero errors, but that does not mean that applications should go around dividing by zero. RDMA wire semantics requires that a sufficient number of pages are committed, and that these are the pages as they will be viewed by the application. There is nothing in the protocol that is inconsistent with an OS or Hypervisor *substituting* pages in a memory region (as long as it is done in a way that honors updates to those pages). Great care must be taken when substituting pages that are DMA accessible, but substituting pages out from under a running application isn't exactly trivial either. Virtual Memory Managers (either OS or hypervisor) should be presumed to understand when they have to preserve the contents of a page. RDMA presents some special challenges here because the RDMA layer has no knowledge of the intended usage of tagged memory buffers, nor does it track the history of access using R-Keys/STags. So the RDMA protocols do allow flexibility in what an R-Key/STag maps to even while the R-Key or STag is externally advertised. But existing RDMA verb have no support for updating the meaning of an R-Key/STag without first invalidating it. However, that is a verbs/implementation issue -- not an RDMA wire protocol requirement. New APIs that allow Virtual Memory Managers to substitute pages in a Memory Region are feasible and may have valuable use cases, but they need to be introduced on an evolutionary basis. Existing hardware will not support them. But as long as such features are not used to enable irresponsible over- subscription of pages there is no reason why new devices could not support such concepts (or even sufficiently updatable devices). RDMA devices already generate a "fault" when they cannot place to host memory. The difference is whether they can be instructed to drop the packet before acking it rather than terminating the connection. And the host can respond to the fault either by terminating the connection or by repairing the problem. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
