Re: [openflowplugin-dev] [opendaylight-dev] netvirt 3node carbon csit in trouble

Jamo Luhrsen Wed, 22 Mar 2017 09:27:56 -0700

+release

what are we doing here? I think this needs to be resolved asap, as I know 
netvirt
3node jobs can get in a bad state and be stuck for the full 6 hour timeout. 
this is
surely affecting our jenkins queue.


https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/buildTimeTrend

can we merge the revert patch?

or do we need to disable the 3node jobs for now?

we can file a bug, but that is just overhead if we can get this resolved soon.

Thanks,
JamO

On 03/21/2017 10:15 PM, Luis Gomez wrote:
> Hi Jamo, I can confirm the controller patch introduced the regression,
> 
> after building the revert:
> 
> https://git.opendaylight.org/gerrit/#/c/53643/
> 
> things go back to normal in cluster test:
> 
> https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-carbon/4/archives/log.html.gz
> 
> BR/Luis
> 
> 
>> On Mar 21, 2017, at 3:22 PM, Luis Gomez <[email protected] 
>> <mailto:[email protected]>> wrote:
>>
>> Right, something really broke the ofp cluster in carbon between Mar 19th 
>> 7:22AM UTC and Mar 20th 10:53AM UTC. The patch you
>> point out is in that interval.
>>
>> It seems the controller cluster test in carbon is far from stable so 
>> difficult to tell when the regression was introduced
>> by looking at it:
>>
>> https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-carbon/
>>
>> Finally, how does controller people verify patches? I do not see any patch 
>> test job like we have in other projects.
>>
>> BR/Luis
>>
>>> On Mar 21, 2017, at 2:15 PM, Jamo Luhrsen <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>
>>> +openflowplugin and controller teams
>>>
>>> TL;DR
>>>
>>> I think this controller patch caused some breakages in our 3node CSIT.
>>>
>>> https://git.opendaylight.org/gerrit/#/c/49265/
>>>
>>>
>>> both functionality of the controller as well as giving us a ton more
>>> logs which creates other problems.
>>>
>>> I think 3node ofp csit is broken too:
>>>
>>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-carbon/
>>>
>>> I ran some csit tests in the sandbox, (jobs 1-4) here:
>>>
>>> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-newton-nodl-v2-jamo-upstream-transparent-carbon/
>>>
>>>
>>> you can see job 1 is yellow, and the rest are 100% pass. They are using
>>> distros from nexus as they were published from *4500.zip down to *4997.zip
>>>
>>> the only difference between 4500 and 4499 is that controller patch above.
>>>
>>> Of course something in our env/csit could have changed too, but the karaf
>>> logs are definitely bigger in netvirt csit. We collect just expections in
>>> a single file and it's ~30x more in a failed job.
>>>
>>> Thanks,
>>> JamO
>>>
>>> On 03/21/2017 01:49 PM, Jamo Luhrsen wrote:
>>>> current theory is our karaf.log is getting a lot more messages now. I 
>>>> found one
>>>> job that didn't get aborted. It did run for 5h33m though:
>>>>
>>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/376/
>>>>
>>>> the robot logs didn't get created because the generated output.xml was too 
>>>> big the
>>>> tool to make the .html reports failed or quit. Locally, I could create the 
>>>> .html
>>>> with that output.xml
>>>>
>>>> We have this trouble before where all of a sudden lots more logging comes 
>>>> in and
>>>> it breaks our jobs.
>>>>
>>>> still getting to the bottom of it...
>>>>
>>>> JamO
>>>>
>>>> On 03/21/2017 10:39 AM, Jamo Luhrsen wrote:
>>>>> Netvirt, Integration,
>>>>>
>>>>> we need to figure out and fix what's wrong with the netvirt 3node carbon 
>>>>> csit.
>>>>>
>>>>> the jobs are timing out at our jenkins 6h limit. that means we don't
>>>>> get any logs either.
>>>>>
>>>>> This will likely cause a large backlog in our jenkins queue.
>>>>>
>>>>> If anyone has cycles at the moment to help, catch me on IRC.
>>>>>
>>>>> Initially, with Alon's help, we know that this job [0] was not seeing
>>>>> this trouble. This job [1].
>>>>>
>>>>> the difference in ODL patches between the two distros that were used
>>>>> have some controller patches that seem cluster related. here are all
>>>>> the patches that came in between the two:
>>>>>
>>>>> controller   https://git.opendaylight.org/gerrit/49265BUG-5280: add 
>>>>> frontend state lifecycle
>>>>> controller   https://git.opendaylight.org/gerrit/49738BUG-2138: Use 
>>>>> correct actor context in shard lookup.
>>>>> controller   https://git.opendaylight.org/gerrit/49663BUG-2138: Fix shard 
>>>>> registration with ProxyProducers.
>>>>>
>>>>> From the looks of the console log (all we have) it seems that each
>>>>> test case is just taking a long time. I don't know more than that
>>>>> at the moment.
>>>>>
>>>>> JamO
>>>>>
>>>>>
>>>>>
>>>>> [0]
>>>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/373/
>>>>> [1]
>>>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/374/
>>>>>
>>> _______________________________________________
>>> dev mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.opendaylight.org/mailman/listinfo/dev
>>
> 
_______________________________________________
openflowplugin-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev

Re: [openflowplugin-dev] [opendaylight-dev] netvirt 3node carbon csit in trouble

Reply via email to