Alex, On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <[EMAIL PROTECTED]> wrote: > I spoke with Evgeny, a Mellanox opensm owner. > He claims that there were similar try in Linux to avoid the subnet manager > communication, but currently this feature still has unresolved problems and, > therefore, disabled by default.
Are you referring to the SA cache ? > Also, according to Evgeny, the current problem that opensm is not scalable > (starting from 64x8 MPI jobs) is because we try to connect opensm after > "PORT UP" event and not after "IPoIB UP" event. What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What is used to determine this ? I also don't understand what you mean by "connect opensm". Do you mean query the SM/SA ? -- Hal > Fab, can you modify your patch in order to allow user select between the old > and the new solutions ? (i.e. with/without "avoid the sm patch) > > Thanks, > XaleX > > ________________________________ > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Alex Naslednikov > Sent: Wednesday, September 17, 2008 11:47 AM > To: [email protected]; Fab Tillier; Leonid Keller > Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch > > Reposting this issue to the whole community. > Current Problem: > "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That is, > when restarting an opensm, IPoIB communication stop to work (including > pings) > Detailed Description: > 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for the > enpoints. > Thus, ARP should be sent in order to resume the normal communication. > 1.2 ARP indeed was sent, and even received by the remote side. > But (put attention), we send ARP by broadcast, but the ARP response is > always unicast with INVALID DLID. > Thus, normal communication can't be resumed without ARP response, and ARP > response can't be send without valid dlid. > > Proposed Solution: > 2.1. When receiving an ARP request dlid is equal to zero, delete this > endpoint and recreate it. > 2.2 In order to initialize ARP table (and thus generate ARP requests), > notify to NDIS link down/link up > > Checklist (we executed this checks on 8-node cluster) > 1. Run opensm and validate that ping works > 2. Kill opensm. Ping still should work > 3. Restart opensm on the same node. Ping should work > 4. Rerun #2 > 5. Restart opensm on another node. Ping should work > 6. Run another instance of opensm, such that the previous instance will > switch to "standby mode". Ping should work > OR: > 6A Run another instance of opensm, such that the previous instance will > remain in "active mode". > Then kill active instance. The "standby" instance should enter active mode, > and a ping should remain. > Ping should work > 7.Run 2 different instance of opensm. During the run, clear guid2lid file > and kill active instance. > Passive instance will become active and ping still should work > 8.Change guid2lid file (change lids only) and restart opensm. Ping should > work > IMPORTANT! Validate here that IPoIB adresses didn't changed, but lids did , > so that pings will be sent to the right host > > > Fix to "Avoid the SM" > signed-off by: Alexander Naslednikov (xalex at mellanox.co.il) > =================================================================== > --- ipoib_port.c (revision 3149) > +++ ipoib_port.c (working copy) > @@ -2357,6 +2357,11 @@ > /* Out of date! Destroy the endpoint and replace > it. */ > __endpt_mgr_remove( p_port, *pp_src ); > *pp_src = NULL; > + } > + else if ( ! ((*pp_src)->dlid)) { > + /* Out of date! Destroy the endpoint and replace > it. */ > + __endpt_mgr_remove( p_port, *pp_src ); > + *pp_src = NULL; > } > else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) ) > { > @@ -4153,10 +4158,25 @@ > cl_qlist_init( &mc_list ); > > cl_obj_lock( &p_port->obj ); > + > /* Wait for all readers to complete. */ > while( p_port->endpt_rdr ) > ; > +#if 0 > + __endpt_mgr_remove_all(p_port); > +#else > > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_DISCONNECT, NULL, > 0 ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_CONNECT, NULL, 0 > ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + > if( p_port->p_local_endpt ) > { > cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, > > > > -----Original Message----- > From: Alex Naslednikov > Sent: Sunday, September 14, 2008 5:32 PM > To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller > Cc: Ishai Rabinovitz > Subject: RE: [ofw] Problem with "Avoid the SM" patch > > I'd like just to summarize all we said before and to propose a temporarily > solution. > > 1. The Problem > 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for the > enpoints. > Thus, ARP should be sent in order to resume the normal communication. > 1.2 ARP indeed was sent, and even received by the remote side. > But (put attention), we send ARP by broadcast, but the ARP response is > always unicast with INVALID DLID. > Thus, normal communication can't be resumed withoud ARP response, and ARP > response can't be send without valid dlid. > > So, in order to resolve it, there's our proposal. It's a temporary solution > only. > Of course, it should be investigated on a large cluster > > 2. The solution > 2.1. When receiving an ARP request dlid is equal to zero, delete this > endpoint and recreate it. > 2.2 In order to initialize ARP table (and thus generate ARP requests), > notify to NDIS link down/link up > > > =================================================================== > --- ipoib_port.c (revision 3149) > +++ ipoib_port.c (working copy) > @@ -2357,6 +2357,11 @@ > /* Out of date! Destroy the endpoint and replace > it. */ > __endpt_mgr_remove( p_port, *pp_src ); > *pp_src = NULL; > + } > + else if ( ! ((*pp_src)->dlid)) { > + /* Out of date! Destroy the endpoint and replace > it. */ > + __endpt_mgr_remove( p_port, *pp_src ); > + *pp_src = NULL; > } > else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) ) > { > @@ -4153,10 +4158,25 @@ > cl_qlist_init( &mc_list ); > > cl_obj_lock( &p_port->obj ); > + > /* Wait for all readers to complete. */ > while( p_port->endpt_rdr ) > ; > +#if 0 > + __endpt_mgr_remove_all(p_port); > +#else > > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_DISCONNECT, NULL, > 0 ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_CONNECT, NULL, 0 > ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + // IPOIB_PRINT( TRACE_LEVEL_INFORMATION, > IPOIB_DBG_INIT, > + // ("Link DOWN!\n") ); > + > if( p_port->p_local_endpt ) > { > cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, > > > XaleX > -----Original Message----- > From: Fab Tillier [mailto:[EMAIL PROTECTED] > Sent: Friday, September 12, 2008 6:00 PM > To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller > Cc: [email protected] > Subject: RE: [ofw] Problem with "Avoid the SM" patch > >> Hi Fab, >> >> Here is some more information about the issue and one question. >> There are currently two problems that we see. Both problems start >> after we restart opensm. >> >> 1) After we restart opensm arp messages don't pass. The main reason we >> saw so far is that they are sent with the wrong addresses. Although we >> haven't still found exactly why that is, we will soon find that and >> fix it. > > Is it just a problem with the ARP responses, or the requests too? The > requests should be getting sent to the broadcast group, so they should > work. The response is a unicast packet, so could be getting lost due to the > dlid == 0 issue. > >> 2) This is the more problematic issue: After we restart opensm >> __endpt_mgr_reset_all is being called. As a result all our endpoint >> cache is cleared. Please note that windows is not aware of what >> happened and therefore it doesn't generate arps but rather sends unicast >> packets. >> For this packets we don't have enough information in the end point and >> therefore we can't send them correctly. In the past for this packets >> we used to do a query on the SM, but we don't want to do that anymore. >> So my question is this, how do we want to solve this issue: >> 1) Wait for the windows arp table to flash? Probably too long. >> 2) Send queries to the SM? We wanted to avoid that. >> 3) Don't clear the endpoints when opensm is being restarted? >> Seems that we might use old data. >> 4) Send arps by ourselves? Probably the best solution but requires >> some more work. >> >> What do you think? > > I think the key here might be to keep *some* of the SM interaction - > effectively put a path record cache in IPoIB. If we kept the existing path > record query logic in IPoIB the issues with SM restart go away. We would > then need to change how the MAC_TO_PATH IOCTL behaved, allowing requests to > be queued and completed asynchronously. The IOCTL handler would look up the > endpoint, and if no path was resolved would issue the path query if it > wasn't in progress already. This would require queueing the IRPs and > tracking them so that a path query completion would complete any pending > IRPs. > > Probably the simplest way to handle this would be to queue the IRPs in the > IBAT layer when they come in, and then try to flush as many IRPs from the > queue (look to see if the endpoints have valid paths). Any endpoint that > needs a path would have a query issued, and a path query completion would > again try to flush as many IRPs form the IBAT queue as possible. > > The main advantages to this is that real path records would be used for > unicast traffic as well as IBAT clients, so that the packet rate, MTU, and > so forth are set optimally, but the cache would be updated whenever an ARP > response is received, remaining in sync with the network stack. > > I hope the SM would not have a problem with path queries like this - the > query load would grow as the square of number of nodes, rather than the > square of the number of cores. > > -Fab > > _______________________________________________ > ofw mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
