Todd, Sorry for this late reply, > I am having similar issues with the same firmware. > Can you give me some more details?
I have this bug on MT25204 (InfiniHost III Lx HCA memfree rev a0 PCI Express), on 30 nodes, with firmware 1.2.0 (the last from 26 December 2006). Note that i have no problem with my MT23108 (InfiniHost 2MiB rev a1 PCI-X), this last board give a normal error when srq are empty when receiving a new buffer. > Did you make the changes on the driver side or the application? In my application (my application directly use libibverbs), i just change the max number of completion event in completion queue ( ibv_vreate_cq() ) and the max number of receive buffer (ibv_create_srq()), and i always post enough buffer in srq than needed by my apply conception (my apply can not receive more than N buffer without consumed some of them and tell to the sender it's ok). With these changes, now my appli can no more receive more buffer than buffer posted in srq and always have enough place cq for all completion event (receive+send completion). So now, i have no more catastrophic error, but i have sometimes "ib_mthca 0000:0c:00.0: Async event for bogus QP 00180405", in this case the buffer was correctly sent (no error on sender) but receiver was not wake up in its ibv_get_cq_event(). > If on the driver, can you point me in the right direction to make those > changes? Perhaps, you change is only to increase you srq/cq length, post enought buffer in it, and add things to wake up your ibv_get_cq_event() after some timeout to see if ibv_poll_cq() can find something. But, it seems that the men of openfabrics working on this bug " iser/lustre memfree issues" Olivier _______________________________________________ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
