Hi Jack, Thanks for your reply. The HCA I'm using is memory free, the chip is MT25204 and the HCA type is arbel, so it doesn't go through the "if (ah->type == MTHCA_AH_ON_HCA)" part of code. By checking the debug output, I got more details about this problem:
The SW2HW_MPT command is issued while UDAV table is been creating. During the time that the driver is waiting for the completion of the command, it does many other things: creating send mad package, posting send mad request to the SQ and posting another receive mad request to the RQ. There's no error report for all of these actions. However after it, the HCA report command parameter error for the SW2HW_MPT. I've copied a snippet context of the debug trace output when this error happens, hopefully it will help spot the reason. 139903841835 HCR CMD: op_code: LE: d 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler 139903890876 HCR CMD: in_param_h: LE: 0 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler 139903993296 HCR CMD: in_param_l: LE: cf616000 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc 139904094753 HCR CMD: input_modifier: LE: 1e 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create MTHCA DBG: <mthca_av.c:229> Created UDAV at 8075220/00000000: 139904197065 HCR CMD: out_pram_h: LE: 0 139904333343 [ 0] 01000005 139904384499 HCR CMD: out_pram_l: LE: 0 139904428086 [ 4] 0000ffff 139904478675 HCR CMD: token: LE: ffff0000 139904520156 [ 8] 00003000 139904572059 HCR CMD: op_code_modifier: LE: 0 139904612802 [ c] 00000000 139904667693 HCR CMD: event: LE: 0 139904708526 [10] 00000000 139904758422 HCR CMD 0x18h: LE=80000d, BE=d008000 139904799210 [14] 00000000 139904904204 [18] 00000000 139904946792MTHCA DBG: <mthca_cmd.c:195> HCR_STATUS 40100698= d008000 ? 8000 [1c] 00000002 139905076860 TRACE: mthca_av.c:235/mthca_create_ah 139905112329 TRACE: mthca_av.c:243/mthca_create_ah 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create 636959 DEBUG: <mthca_qp.c:1908> Start mthca_arbel_post_send. qp 0 wr 8d984b8 139905324432 TRACE: mthca_qp.c:1911/mthca_arbel_post_send 139905359505 TRACE: mthca_qp.c:1939/mthca_arbel_post_send 139905418932 TRACE: mthca_qp.c:1949/mthca_arbel_post_send 636959 DEBUG: <mthca_qp.c:1953> qp is not direct access and wqe: 0x8d84400 139905541467 TRACE: mthca_qp.c:1954/mthca_arbel_post_send 139905577647 TRACE: mthca_qp.c:1964/mthca_arbel_post_send 139905614565 TRACE: mthca_qp.c:2057/mthca_arbel_post_send 139905669411 TRACE: mthca_qp.c:2076/mthca_arbel_post_send 139905705726 TRACE: mthca_qp.c:2078/mthca_arbel_post_send 636959 DEBUG: <mthca_qp.c:2087> wr sg length 0x18, lkey 0x80001900, local addr 0xce2393b8 139905831060 TRACE: mthca_qp.c:2078/mthca_arbel_post_send 636959 DEBUG: <mthca_qp.c:2087> wr sg length 0xe8, lkey 0x80001900, local addr 0xce2393d0 139905956322 TRACE: mthca_qp.c:2092/mthca_arbel_post_send 636959 DEBUG: <mthca_qp.c:2101> wr id 148473016 139906069875 TRACE: mthca_qp.c:2120/mthca_arbel_post_send 139906106379 TRACE: mthca_qp.c:2128/mthca_arbel_post_send 139906142892 TRACE: mthca_qp.c:2131/mthca_arbel_post_send 139906178640 TRACE: mthca_qp.c:2135/mthca_arbel_post_send 139906214703 TRACE: mthca_qp.c:2158/mthca_arbel_post_send 139906250568 TRACE: mthca_qp.c:2160/mthca_arbel_post_send 636959 DEBUG: <mthca_qp.c:2162> End mthca_arbel_post_send. err 0 139906369953 TRACE: mad.c:650/ib_mad_recv_done_handler 139906406295 TRACE: mad.c:669/ib_mad_recv_done_handler 139906441539 TRACE: mad.c:672/ib_mad_recv_done_handler 636959 QNX DBG: <mad.c:530> mad_priv->header.mad_list.mad_queue->list.prev 88b0a2c 139906578384 TRACE: mthca_qp.c:2177/mthca_arbel_post_receive 139906614168 TRACE: mthca_qp.c:2194/mthca_arbel_post_receive 139906649295 TRACE: mthca_qp.c:2196/mthca_arbel_post_receive 139906689129 TRACE: mad.c:674/ib_mad_recv_done_handler 139906723068 TRACE: mad.c:676/ib_mad_recv_done_handler 636959 QNX DBG: <linux_cache.c:151> kmem_cache 5 free object=88b0724 139906793007 HCR CMD: Status Return: : 3 Again, thanks for your help! Best, Yicheng Jack Morgenstein <[EMAIL PROTECTED]> 01/01/2008 01:03 AM To [email protected] cc Yicheng Jia <[EMAIL PROTECTED]>, Roland Dreier <[EMAIL PROTECTED]> Subject Re: [ofa-general] synchronize commands issued to MTHCA On Tuesday 01 January 2008 03:02, Yicheng Jia wrote: Does your HCA use on-board memory? (Run: "lspci" and look at "Mellanox" lines. You have on-board memory if you see either: PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA bridge (rev a1) InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA (rev a1) OR: InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) ) In that case, when you create an AH in kernel space (file mthca_av.c, procedure mthca_create_ah() ), you will enter the following flow: if (ah->type == MTHCA_AH_ON_HCA) { memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, av, MTHCA_AV_SIZE); kfree(av); } Roland, do you think that the memcpy_toio() call might mess things up? Maybe we need "wmb()" or "mmiowb()" here as well? - Jack > Hi Roland, > > Thanks for your reply! > > Actually I'm working on porting IB driver to QNX platform. I resume the > work started by my former colleague, and I just found that the sync codes > (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown > reason. After adding back these sync codes, the driver runs much > smoothlier. > > However I still get a command exec error which I believe is relevant to > command synchronization. The problem is when "Created UDAV" is called > during SW2HW_MPT command is being executed, the SW2HW_MPT command would > return with bad parameter error. Here are my debug trace output: > > 139903841835 HCR CMD: op_code: LE: d > 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler > 139903890876 HCR CMD: in_param_h: LE: 0 > 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler > 139903993296 HCR CMD: in_param_l: LE: cf616000 > 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc > 139904094753 HCR CMD: input_modifier: LE: 1e > 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create > MTHCA DBG: <mthca_av.c:229> Created UDAV at 8075220/00000000: > 139904197065 HCR CMD: out_pram_h: LE: 0 > 139904333343 [ 0] 01000005 > 139904384499 HCR CMD: out_pram_l: LE: 0 > 139904428086 [ 4] 0000ffff > 139904478675 HCR CMD: token: LE: ffff0000 > 139904520156 [ 8] 00003000 > 139904572059 HCR CMD: op_code_modifier: LE: 0 > 139904612802 [ c] 00000000 > 139904667693 HCR CMD: event: LE: 0 > 139904708526 [10] 00000000 > 139904758422 HCR CMD 0x18h: LE=80000d, BE=d008000 > 139904799210 [14] 00000000 > 139904904204 [18] 00000000 > 139904946792MTHCA DBG: <mthca_cmd.c:195> HCR_STATUS 40100698= d008000 ? > 8000 > [1c] 00000002 > 139905076860 TRACE: mthca_av.c:235/mthca_create_ah > 139905112329 TRACE: mthca_av.c:243/mthca_create_ah > 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create > .... > 139906793007 HCR CMD: Status Return: : 3 > > Do you have any idea? > > Thanks and have a good new year! > Yicheng > > > > > Roland Dreier <[EMAIL PROTECTED]> > 12/28/2007 11:39 PM > > To > Yicheng Jia <[EMAIL PROTECTED]> > cc > [email protected] > Subject > Re: [ofa-general] synchronize commands issued to MTHCA > > > > > > > > I'm using OFED-1.0 and the problem I believe is related to command > > synchronization of HCA. The host issues a MAD_INF command at first and > > then a SW2HW_MTP command without waiting for the completion of the > first > > command. Both of commands return with bad parameters error. > > I guess you mean the MAD_IFC and SW2HW_MPT commands? I've never heard > of a problem like that -- more details about your hardware/software > config and the exact symptoms you see would be helpful in debugging. > > Anyway OFED 1.0 is ancient by now -- you are much better off just > using drivers from the standard kernel. If you must use OFED, then > OFED 1.2 or even a 1.3 prerelease would be better. > > > My question is why there's no synchronization mechanism for the command > > > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize > > between every command? > > The HCA firmware allows multiple commands to be queued. The > dev->cmd.event_sem semaphore is used to limit the number of > outstanding commands to the HCA's capabilities, and the > dev->cmd.hcr_mutex mutex is used to serialize the actual writing of > commands to the HCA. > > There was a mmiowb() added to mthca_cmd_post() fairly recently that > might fix your problems if you are running on a large SGI Altix system. > > - R. > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
