2025

Ingerson, Alexia Wed, 05 Mar 2025 09:11:39 -0800

Date: 03/04/2025
Participants:
Alexia Ingerson (Intel)
Jianxin Xiong (Intel)
Ben Lynam (Cornelis)
Charles Shereda (Cornelis)
Ian Ziemba (HPE)
Jerome Soumagne (HPE)
Juee Desai (Intel)
Ken Raffenetti (ANL)
Sai Sunku (AWS)
Stephen Oost (Intel)


Summary:
2.1.0 RC1 out, RC2 scheduled for 3/8, GA scheduled for 3/15. Mark any 
cherry-picks with the new "for-2.1.x" label.

RPC issues dealing with persistent server and transient clients - new client 
sees stale replies intended for old client. Two PRs target this issue. #10837 
updates documentation to try to decouple RDM from error cases (failure 
shouldn't close down RDM endpoint). #10792 adds a new tag format to essentially 
allow for ignoring unmatched messages for this case. There were mixed opinions 
about this solution as it seems it has a very limited (and maybe temporary) use 
case - don't want to include something too targeted in the API. Plan to look 
into tcp provider for provider specific implementation.

Notes:
Release 2.1.0 update:

  *   RC1 out 3/1/2025
  *   New branch v2.1.x
  *   RC2 scheduled for 3/8/2025
     *   Psm3 update
     *   Bug fixes for other providers
     *   New label "for-2.1.x"
  *   GA scheduled for 3/15/2025
RPC issues

  *   Persistent server, transient clients
     *   Client should not bring down server
     *   New client sees stale replies intended for old client
        *   Tagged messages for replies
        *   Tag is specific for the reply
        *   State reply won't find match (stuck in unexpected queue)
  *   Q: there were some concerns about return EAGAIN? Is that still a concern?
     *   That's more of a provider-specific detail. This PR (10837) is just to 
update documentation
     *   EAGAIN isn't appropriate because we shouldn't retry (client side is 
already down)
  *   Trying to decouple RDM endpoint from regular error cases - failure 
shouldn't close down RDM endpoint
  *   Also added new error type - unreachable EP for if client died
  *   Other PR (10792) proposes new tag format to essentially allow ignoring 
unmatched messages
     *   Reason for specifying as tag format is more to defined provider 
behavior
        *   Should we focus more an application behavior?
        *   IZ: agree we should focus on application behavior and up to 
provider to handle that
     *   IZ: What's missing from PR is that is uses mem tag format but never 
explains what that means
        *   Exact match vs TAG_BITS to use ignore bits
        *   Mercury uses 32 bits but don't need to impose that for this 
definition. Don't use mask, match on entire tag
     *   Original ask was to drop unexpected messages but has turned into tag 
matching definitions. Are these related any more?
        *   We don't need it if we have a better way to handle it but not sure 
what that would look like
        *   Not really any other usage outside of this use case for tag format
     *   In efa, use timestamp to generate unique connection id so messages for 
old peers can be easily identified and dropped
        *   The issue seems to be within the provider - not being able to 
distinguish new connections
     *   Going to revist so we don't introduce a new tag format for limited use 
case and limited time just for one provider - will look into tcp provider for 
provider specific implementation

_______________________________________________
ofiwg mailing list
ofiwg@lists.openfabrics.org
https://lists.openfabrics.org/mailman/listinfo/ofiwg

[ofiwg] OFIWG notes 03/05/2025

Reply via email to