On 25 July 2012 06:14, Joseph Glanville <[email protected]> wrote:
> On 25 July 2012 05:59, Bart Van Assche <[email protected]> wrote:
>> On 07/24/12 19:50, Joseph Glanville wrote:
>>> On 25 July 2012 03:53, Bart Van Assche <[email protected]> wrote:
>>>> On 07/24/12 15:16, Joseph Glanville wrote:
>>>>> I have been seeing this KP occur about every 3 days on our staging 
>>>>> cluster.
>>>>> I am not exactly sure what the root cause would be.. I assume this
>>>>> would be a bug in SCST.
>>>>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>>>>> HA patches.
>>>>
>>>> It would help if you could tell us a bit more about your setup. It looks
>>>> like SCST is running in dom0, and an IB workload in domU ? If so, which
>>>> workload was running in domU ?
>>>
>>> There is no IB workload in the domU's.
>>> In this particular case there are 2 dom0s connected together both
>>> acting as SRP targets and initators.
>>> Their are sometimes vms running on these dom0s but they aren't
>>> currently in production so they aren't doing very much at the moment.
>>>
>>> The workload is typically one of adding and removing luns to
>>> ini_groups, rescan the host to ensure they are removed cleanly etc.
>>> As far as I can tell this would have to manifest as a race condition
>>> as it can go for about 2 or so weeks without occuring.
>>> Also worth noting is that I have a similar setup running on 2.6.32
>>> with no issues also a pvops dom0 using SCST and ib_srp.
>>>
>>> Could it be your patch series introduced the bug? Those are the only
>>> patches we have in our tree that effect SRP.
>>
>> You might be hitting a device removal bug in the SCSI core. It would be
>> appreciated if you could retest with the srp-ha branch of this kernel
>> tree: http://github.com/bvanassche/linux. That tree contains Linux
>> kernel 3.5 + SCSI 3.6-rc1 + latest (yet to be posted) srp-ha patch series.
>>
>> Bart.
>
> Will do.
>
> --
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846

Hi Bart,

I managed to trigger the bug (kernel oops on the null deref but didn't
KP this time. This is with the SRP HA patches removed.
To trigger I was merely removing luns and rescanning on the initiator
many times per minute for a few hours.

I will pull down the tree you mentioned and try reproduce.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to