> still, I am not sure to be with you, the mads used by the CM aren't > reliable, correct? > so I don't see why/how a mad containing e.g junk DLID completes with > error...
CM mads aren't reliable, however they are retried. If a CM REQ does not receive a response after so many retries (usually 15), the REQ fails (status is timeout). The mad layer reports the timeout to the cm module. With snooping in place, a user will be notified that a mad send has failed and be given a copy of the mad. At a higher level, this would be one usage model: 1. App calls rdma_getaddrinfo() 2. The librdmacm contacts the ibacm for path record data. 3. ibacm returns a path record. The path record _may_ have come from cached data. 4. The librdmacm tries to establish a connection. 5. The kernel ib_cm module issues REQ. 6. The ib_mad module retries the REQ until it times out. 7. The mad timeout is reported to any users wishing to capture errors. In this example, the ibacm service would be registered and receive a copy of the failed REQ. The ibacm can look at the data in the REQ, see if it if has cached path record data which matches, and remove the cached data if so. If the REQ data cannot be found (for example, someone sent a REQ with a junk DLID), it simply discards the captured mad. 8. The librdmacm will see a connection failure. 9. The librdmacm can request a new path from the ibacm and retry. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
