Hal, Below are the answers. Q1. Are you referring to the SA cache ? Yes, definitely.
Q2. I also don't understand what you mean by "connect opensm". Do you mean query the SM/SA ? Yes. Q3.What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What is used to determine this ? Currently, ipoib_port_up() function immediately starts sending SA queries to a broadcast group. Evgeny meant here some additional delay to allow opensm to start. I understood that Fab checked this issue (by 10 retries of 1 second TO) and found that it didn't help there. Yet another try can be enlarging the TO to be 5 sec and sending less retries XaleX -----Original Message----- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Thursday, September 18, 2008 3:20 PM To: Alex Naslednikov Cc: [email protected]; Fab Tillier; Leonid Keller Subject: Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch Alex, On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <[EMAIL PROTECTED]> wrote: > I spoke with Evgeny, a Mellanox opensm owner. > He claims that there were similar try in Linux to avoid the subnet > manager communication, but currently this feature still has unresolved > problems and, therefore, disabled by default. Are you referring to the SA cache ? > Also, according to Evgeny, the current problem that opensm is not > scalable (starting from 64x8 MPI jobs) is because we try to connect > opensm after "PORT UP" event and not after "IPoIB UP" event. What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What is used to determine this ? I also don't understand what you mean by "connect opensm". Do you mean query the SM/SA ? -- Hal > Fab, can you modify your patch in order to allow user select between > the old and the new solutions ? (i.e. with/without "avoid the sm > patch) > > Thanks, > XaleX > > ________________________________ > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Alex > Naslednikov > Sent: Wednesday, September 17, 2008 11:47 AM > To: [email protected]; Fab Tillier; Leonid Keller > Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch > > Reposting this issue to the whole community. > Current Problem: > "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That > is, when restarting an opensm, IPoIB communication stop to work > (including > pings) > Detailed Description: > 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 > for the enpoints. > Thus, ARP should be sent in order to resume the normal communication. > 1.2 ARP indeed was sent, and even received by the remote side. > But (put attention), we send ARP by broadcast, but the ARP response is > always unicast with INVALID DLID. > Thus, normal communication can't be resumed without ARP response, and > ARP response can't be send without valid dlid. > > Proposed Solution: > 2.1. When receiving an ARP request dlid is equal to zero, delete this > endpoint and recreate it. > 2.2 In order to initialize ARP table (and thus generate ARP requests), > notify to NDIS link down/link up > > Checklist (we executed this checks on 8-node cluster) 1. Run opensm > and validate that ping works 2. Kill opensm. Ping still should work 3. > Restart opensm on the same node. Ping should work 4. Rerun #2 5. > Restart opensm on another node. Ping should work 6. Run another > instance of opensm, such that the previous instance will switch to > "standby mode". Ping should work > OR: > 6A Run another instance of opensm, such that the previous instance > will remain in "active mode". > Then kill active instance. The "standby" instance should enter active > mode, and a ping should remain. > Ping should work > 7.Run 2 different instance of opensm. During the run, clear guid2lid > file and kill active instance. > Passive instance will become active and ping still should work > 8.Change guid2lid file (change lids only) and restart opensm. Ping > should work IMPORTANT! Validate here that IPoIB adresses didn't > changed, but lids did , so that pings will be sent to the right host > > > Fix to "Avoid the SM" > signed-off by: Alexander Naslednikov (xalex at mellanox.co.il) > =================================================================== > --- ipoib_port.c (revision 3149) > +++ ipoib_port.c (working copy) > @@ -2357,6 +2357,11 @@ > /* Out of date! Destroy the endpoint and > replace it. */ > __endpt_mgr_remove( p_port, *pp_src ); > *pp_src = NULL; > + } > + else if ( ! ((*pp_src)->dlid)) { > + /* Out of date! Destroy the endpoint and > + replace > it. */ > + __endpt_mgr_remove( p_port, *pp_src ); > + *pp_src = NULL; > } > else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) ) > { > @@ -4153,10 +4158,25 @@ > cl_qlist_init( &mc_list ); > > cl_obj_lock( &p_port->obj ); > + > /* Wait for all readers to complete. */ > while( p_port->endpt_rdr ) > ; > +#if 0 > + __endpt_mgr_remove_all(p_port); #else > > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_DISCONNECT, > + NULL, > 0 ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_CONNECT, > + NULL, 0 > ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + > if( p_port->p_local_endpt ) > { > cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, > > > > -----Original Message----- > From: Alex Naslednikov > Sent: Sunday, September 14, 2008 5:32 PM > To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller > Cc: Ishai Rabinovitz > Subject: RE: [ofw] Problem with "Avoid the SM" patch > > I'd like just to summarize all we said before and to propose a > temporarily solution. > > 1. The Problem > 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 > for the enpoints. > Thus, ARP should be sent in order to resume the normal communication. > 1.2 ARP indeed was sent, and even received by the remote side. > But (put attention), we send ARP by broadcast, but the ARP response is > always unicast with INVALID DLID. > Thus, normal communication can't be resumed withoud ARP response, and > ARP response can't be send without valid dlid. > > So, in order to resolve it, there's our proposal. It's a temporary > solution only. > Of course, it should be investigated on a large cluster > > 2. The solution > 2.1. When receiving an ARP request dlid is equal to zero, delete this > endpoint and recreate it. > 2.2 In order to initialize ARP table (and thus generate ARP requests), > notify to NDIS link down/link up > > > =================================================================== > --- ipoib_port.c (revision 3149) > +++ ipoib_port.c (working copy) > @@ -2357,6 +2357,11 @@ > /* Out of date! Destroy the endpoint and > replace it. */ > __endpt_mgr_remove( p_port, *pp_src ); > *pp_src = NULL; > + } > + else if ( ! ((*pp_src)->dlid)) { > + /* Out of date! Destroy the endpoint and > + replace > it. */ > + __endpt_mgr_remove( p_port, *pp_src ); > + *pp_src = NULL; > } > else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) ) > { > @@ -4153,10 +4158,25 @@ > cl_qlist_init( &mc_list ); > > cl_obj_lock( &p_port->obj ); > + > /* Wait for all readers to complete. */ > while( p_port->endpt_rdr ) > ; > +#if 0 > + __endpt_mgr_remove_all(p_port); #else > > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_DISCONNECT, > + NULL, > 0 ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + NdisMIndicateStatus( p_port->p_adapter->h_adapter, > + NDIS_STATUS_MEDIA_CONNECT, > + NULL, 0 > ); > + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); > + > + // IPOIB_PRINT( TRACE_LEVEL_INFORMATION, > IPOIB_DBG_INIT, > + // ("Link DOWN!\n") ); > + > if( p_port->p_local_endpt ) > { > cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, > > > XaleX > -----Original Message----- > From: Fab Tillier [mailto:[EMAIL PROTECTED] > Sent: Friday, September 12, 2008 6:00 PM > To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller > Cc: [email protected] > Subject: RE: [ofw] Problem with "Avoid the SM" patch > >> Hi Fab, >> >> Here is some more information about the issue and one question. >> There are currently two problems that we see. Both problems start >> after we restart opensm. >> >> 1) After we restart opensm arp messages don't pass. The main reason >> we saw so far is that they are sent with the wrong addresses. >> Although we haven't still found exactly why that is, we will soon >> find that and fix it. > > Is it just a problem with the ARP responses, or the requests too? The > requests should be getting sent to the broadcast group, so they should > work. The response is a unicast packet, so could be getting lost due > to the dlid == 0 issue. > >> 2) This is the more problematic issue: After we restart opensm >> __endpt_mgr_reset_all is being called. As a result all our endpoint >> cache is cleared. Please note that windows is not aware of what >> happened and therefore it doesn't generate arps but rather sends >> unicast packets. >> For this packets we don't have enough information in the end point >> and therefore we can't send them correctly. In the past for this >> packets we used to do a query on the SM, but we don't want to do that anymore. >> So my question is this, how do we want to solve this issue: >> 1) Wait for the windows arp table to flash? Probably too long. >> 2) Send queries to the SM? We wanted to avoid that. >> 3) Don't clear the endpoints when opensm is being restarted? >> Seems that we might use old data. >> 4) Send arps by ourselves? Probably the best solution but requires >> some more work. >> >> What do you think? > > I think the key here might be to keep *some* of the SM interaction - > effectively put a path record cache in IPoIB. If we kept the existing > path record query logic in IPoIB the issues with SM restart go away. > We would then need to change how the MAC_TO_PATH IOCTL behaved, > allowing requests to be queued and completed asynchronously. The > IOCTL handler would look up the endpoint, and if no path was resolved > would issue the path query if it wasn't in progress already. This > would require queueing the IRPs and tracking them so that a path query > completion would complete any pending IRPs. > > Probably the simplest way to handle this would be to queue the IRPs in > the IBAT layer when they come in, and then try to flush as many IRPs > from the queue (look to see if the endpoints have valid paths). Any > endpoint that needs a path would have a query issued, and a path query > completion would again try to flush as many IRPs form the IBAT queue as possible. > > The main advantages to this is that real path records would be used > for unicast traffic as well as IBAT clients, so that the packet rate, > MTU, and so forth are set optimally, but the cache would be updated > whenever an ARP response is received, remaining in sync with the network stack. > > I hope the SM would not have a problem with path queries like this - > the query load would grow as the square of number of nodes, rather > than the square of the number of cores. > > -Fab > > _______________________________________________ > ofw mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
