Alex, On Thu, Sep 18, 2008 at 11:29 AM, Alex Naslednikov <[EMAIL PROTECTED]> wrote: > Hal, > Below are the answers. > > Q1. Are you referring to the SA cache ? > Yes, definitely. > > Q2. I also don't understand what you mean by "connect opensm". Do you > mean query the SM/SA ? > Yes. > > Q3.What is meant by "IPoIB UP" ? Does this mean "operationally up" ? > What is used to determine this ? > Currently, ipoib_port_up() function immediately starts sending SA > queries to a broadcast group.
I don't understand what you mean by this. SA queries do not go on the broadcast group; ARPs might. At this point, has the broadcast group been successfully joined ? > Evgeny meant here some additional delay to allow opensm to start. I think you mean respond (and it's any SM)... > I understood that Fab checked this issue (by 10 retries of 1 second TO) > and found that it didn't help there. > Yet another try can be enlarging the TO to be 5 sec and sending less > retries I think some exponential backoff strategy with some randomization might be better. -- Hal > XaleX > > > -----Original Message----- > From: Hal Rosenstock [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 18, 2008 3:20 PM > To: Alex Naslednikov > Cc: [email protected]; Fab Tillier; Leonid Keller > Subject: Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch > > Alex, > > On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <[EMAIL PROTECTED]> > wrote: >> I spoke with Evgeny, a Mellanox opensm owner. >> He claims that there were similar try in Linux to avoid the subnet >> manager communication, but currently this feature still has unresolved > >> problems and, therefore, disabled by default. > > Are you referring to the SA cache ? > > >> Also, according to Evgeny, the current problem that opensm is not >> scalable (starting from 64x8 MPI jobs) is because we try to connect >> opensm after "PORT UP" event and not after "IPoIB UP" event. > > What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What > is used to determine this ? > > I also don't understand what you mean by "connect opensm". Do you mean > query the SM/SA ? > > > -- Hal > >> Fab, can you modify your patch in order to allow user select between >> the old and the new solutions ? (i.e. with/without "avoid the sm >> patch) >> >> Thanks, >> XaleX >> >> ________________________________ >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of Alex >> Naslednikov >> Sent: Wednesday, September 17, 2008 11:47 AM >> To: [email protected]; Fab Tillier; Leonid Keller >> Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch >> >> Reposting this issue to the whole community. >> Current Problem: >> "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That >> is, when restarting an opensm, IPoIB communication stop to work >> (including >> pings) >> Detailed Description: >> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 >> for the enpoints. >> Thus, ARP should be sent in order to resume the normal communication. >> 1.2 ARP indeed was sent, and even received by the remote side. >> But (put attention), we send ARP by broadcast, but the ARP response is > >> always unicast with INVALID DLID. >> Thus, normal communication can't be resumed without ARP response, and >> ARP response can't be send without valid dlid. >> >> Proposed Solution: >> 2.1. When receiving an ARP request dlid is equal to zero, delete this >> endpoint and recreate it. >> 2.2 In order to initialize ARP table (and thus generate ARP requests), > >> notify to NDIS link down/link up >> >> Checklist (we executed this checks on 8-node cluster) 1. Run opensm >> and validate that ping works 2. Kill opensm. Ping still should work 3. > >> Restart opensm on the same node. Ping should work 4. Rerun #2 5. >> Restart opensm on another node. Ping should work 6. Run another >> instance of opensm, such that the previous instance will switch to >> "standby mode". Ping should work >> OR: >> 6A Run another instance of opensm, such that the previous instance >> will remain in "active mode". >> Then kill active instance. The "standby" instance should enter active >> mode, and a ping should remain. >> Ping should work >> 7.Run 2 different instance of opensm. During the run, clear guid2lid >> file and kill active instance. >> Passive instance will become active and ping still should work >> 8.Change guid2lid file (change lids only) and restart opensm. Ping >> should work IMPORTANT! Validate here that IPoIB adresses didn't >> changed, but lids did , so that pings will be sent to the right host >> >> >> Fix to "Avoid the SM" >> signed-off by: Alexander Naslednikov (xalex at mellanox.co.il) >> =================================================================== >> --- ipoib_port.c (revision 3149) >> +++ ipoib_port.c (working copy) >> @@ -2357,6 +2357,11 @@ >> /* Out of date! Destroy the endpoint and >> replace it. */ >> __endpt_mgr_remove( p_port, *pp_src ); >> *pp_src = NULL; >> + } >> + else if ( ! ((*pp_src)->dlid)) { >> + /* Out of date! Destroy the endpoint and >> + replace >> it. */ >> + __endpt_mgr_remove( p_port, *pp_src ); >> + *pp_src = NULL; >> } >> else if( ipoib_is_voltaire_router_gid( > &(*pp_src)->dgid ) ) >> { >> @@ -4153,10 +4158,25 @@ >> cl_qlist_init( &mc_list ); >> >> cl_obj_lock( &p_port->obj ); >> + >> /* Wait for all readers to complete. */ >> while( p_port->endpt_rdr ) >> ; >> +#if 0 >> + __endpt_mgr_remove_all(p_port); #else >> >> + NdisMIndicateStatus( p_port->p_adapter->h_adapter, >> + NDIS_STATUS_MEDIA_DISCONNECT, >> + NULL, >> 0 ); >> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); >> + >> + NdisMIndicateStatus( p_port->p_adapter->h_adapter, >> + NDIS_STATUS_MEDIA_CONNECT, >> + NULL, 0 >> ); >> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); >> + >> + >> if( p_port->p_local_endpt ) >> { >> cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, >> >> >> >> -----Original Message----- >> From: Alex Naslednikov >> Sent: Sunday, September 14, 2008 5:32 PM >> To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller >> Cc: Ishai Rabinovitz >> Subject: RE: [ofw] Problem with "Avoid the SM" patch >> >> I'd like just to summarize all we said before and to propose a >> temporarily solution. >> >> 1. The Problem >> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 >> for the enpoints. >> Thus, ARP should be sent in order to resume the normal communication. >> 1.2 ARP indeed was sent, and even received by the remote side. >> But (put attention), we send ARP by broadcast, but the ARP response is > >> always unicast with INVALID DLID. >> Thus, normal communication can't be resumed withoud ARP response, and >> ARP response can't be send without valid dlid. >> >> So, in order to resolve it, there's our proposal. It's a temporary >> solution only. >> Of course, it should be investigated on a large cluster >> >> 2. The solution >> 2.1. When receiving an ARP request dlid is equal to zero, delete this >> endpoint and recreate it. >> 2.2 In order to initialize ARP table (and thus generate ARP requests), > >> notify to NDIS link down/link up >> >> >> =================================================================== >> --- ipoib_port.c (revision 3149) >> +++ ipoib_port.c (working copy) >> @@ -2357,6 +2357,11 @@ >> /* Out of date! Destroy the endpoint and >> replace it. */ >> __endpt_mgr_remove( p_port, *pp_src ); >> *pp_src = NULL; >> + } >> + else if ( ! ((*pp_src)->dlid)) { >> + /* Out of date! Destroy the endpoint and >> + replace >> it. */ >> + __endpt_mgr_remove( p_port, *pp_src ); >> + *pp_src = NULL; >> } >> else if( ipoib_is_voltaire_router_gid( > &(*pp_src)->dgid ) ) >> { >> @@ -4153,10 +4158,25 @@ >> cl_qlist_init( &mc_list ); >> >> cl_obj_lock( &p_port->obj ); >> + >> /* Wait for all readers to complete. */ >> while( p_port->endpt_rdr ) >> ; >> +#if 0 >> + __endpt_mgr_remove_all(p_port); #else >> >> + NdisMIndicateStatus( p_port->p_adapter->h_adapter, >> + NDIS_STATUS_MEDIA_DISCONNECT, >> + NULL, >> 0 ); >> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); >> + >> + NdisMIndicateStatus( p_port->p_adapter->h_adapter, >> + NDIS_STATUS_MEDIA_CONNECT, >> + NULL, 0 >> ); >> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter ); >> + >> + // IPOIB_PRINT( TRACE_LEVEL_INFORMATION, >> IPOIB_DBG_INIT, >> + // ("Link DOWN!\n") ); >> + >> if( p_port->p_local_endpt ) >> { >> cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts, >> >> >> XaleX >> -----Original Message----- >> From: Fab Tillier [mailto:[EMAIL PROTECTED] >> Sent: Friday, September 12, 2008 6:00 PM >> To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller >> Cc: [email protected] >> Subject: RE: [ofw] Problem with "Avoid the SM" patch >> >>> Hi Fab, >>> >>> Here is some more information about the issue and one question. >>> There are currently two problems that we see. Both problems start >>> after we restart opensm. >>> >>> 1) After we restart opensm arp messages don't pass. The main reason >>> we saw so far is that they are sent with the wrong addresses. >>> Although we haven't still found exactly why that is, we will soon >>> find that and fix it. >> >> Is it just a problem with the ARP responses, or the requests too? The > >> requests should be getting sent to the broadcast group, so they should > >> work. The response is a unicast packet, so could be getting lost due >> to the dlid == 0 issue. >> >>> 2) This is the more problematic issue: After we restart opensm >>> __endpt_mgr_reset_all is being called. As a result all our endpoint >>> cache is cleared. Please note that windows is not aware of what >>> happened and therefore it doesn't generate arps but rather sends >>> unicast packets. >>> For this packets we don't have enough information in the end point >>> and therefore we can't send them correctly. In the past for this >>> packets we used to do a query on the SM, but we don't want to do that > anymore. >>> So my question is this, how do we want to solve this issue: >>> 1) Wait for the windows arp table to flash? Probably too long. >>> 2) Send queries to the SM? We wanted to avoid that. >>> 3) Don't clear the endpoints when opensm is being restarted? >>> Seems that we might use old data. >>> 4) Send arps by ourselves? Probably the best solution but requires >>> some more work. >>> >>> What do you think? >> >> I think the key here might be to keep *some* of the SM interaction - >> effectively put a path record cache in IPoIB. If we kept the existing > >> path record query logic in IPoIB the issues with SM restart go away. >> We would then need to change how the MAC_TO_PATH IOCTL behaved, >> allowing requests to be queued and completed asynchronously. The >> IOCTL handler would look up the endpoint, and if no path was resolved >> would issue the path query if it wasn't in progress already. This >> would require queueing the IRPs and tracking them so that a path query > >> completion would complete any pending IRPs. >> >> Probably the simplest way to handle this would be to queue the IRPs in > >> the IBAT layer when they come in, and then try to flush as many IRPs >> from the queue (look to see if the endpoints have valid paths). Any >> endpoint that needs a path would have a query issued, and a path query > >> completion would again try to flush as many IRPs form the IBAT queue > as possible. >> >> The main advantages to this is that real path records would be used >> for unicast traffic as well as IBAT clients, so that the packet rate, >> MTU, and so forth are set optimally, but the cache would be updated >> whenever an ARP response is received, remaining in sync with the > network stack. >> >> I hope the SM would not have a problem with path queries like this - >> the query load would grow as the square of number of nodes, rather >> than the square of the number of cores. >> >> -Fab >> >> _______________________________________________ >> ofw mailing list >> [email protected] >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw >> > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
