+release what are we doing here? I think this needs to be resolved asap, as I know netvirt 3node jobs can get in a bad state and be stuck for the full 6 hour timeout. this is surely affecting our jenkins queue.
https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/buildTimeTrend can we merge the revert patch? or do we need to disable the 3node jobs for now? we can file a bug, but that is just overhead if we can get this resolved soon. Thanks, JamO On 03/21/2017 10:15 PM, Luis Gomez wrote: > Hi Jamo, I can confirm the controller patch introduced the regression, > > after building the revert: > > https://git.opendaylight.org/gerrit/#/c/53643/ > > things go back to normal in cluster test: > > https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-3node-clustering-only-carbon/4/archives/log.html.gz > > BR/Luis > > >> On Mar 21, 2017, at 3:22 PM, Luis Gomez <[email protected] >> <mailto:[email protected]>> wrote: >> >> Right, something really broke the ofp cluster in carbon between Mar 19th >> 7:22AM UTC and Mar 20th 10:53AM UTC. The patch you >> point out is in that interval. >> >> It seems the controller cluster test in carbon is far from stable so >> difficult to tell when the regression was introduced >> by looking at it: >> >> https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-carbon/ >> >> Finally, how does controller people verify patches? I do not see any patch >> test job like we have in other projects. >> >> BR/Luis >> >>> On Mar 21, 2017, at 2:15 PM, Jamo Luhrsen <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> +openflowplugin and controller teams >>> >>> TL;DR >>> >>> I think this controller patch caused some breakages in our 3node CSIT. >>> >>> https://git.opendaylight.org/gerrit/#/c/49265/ >>> >>> >>> both functionality of the controller as well as giving us a ton more >>> logs which creates other problems. >>> >>> I think 3node ofp csit is broken too: >>> >>> https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-carbon/ >>> >>> I ran some csit tests in the sandbox, (jobs 1-4) here: >>> >>> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-3node-openstack-newton-nodl-v2-jamo-upstream-transparent-carbon/ >>> >>> >>> you can see job 1 is yellow, and the rest are 100% pass. They are using >>> distros from nexus as they were published from *4500.zip down to *4997.zip >>> >>> the only difference between 4500 and 4499 is that controller patch above. >>> >>> Of course something in our env/csit could have changed too, but the karaf >>> logs are definitely bigger in netvirt csit. We collect just expections in >>> a single file and it's ~30x more in a failed job. >>> >>> Thanks, >>> JamO >>> >>> On 03/21/2017 01:49 PM, Jamo Luhrsen wrote: >>>> current theory is our karaf.log is getting a lot more messages now. I >>>> found one >>>> job that didn't get aborted. It did run for 5h33m though: >>>> >>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/376/ >>>> >>>> the robot logs didn't get created because the generated output.xml was too >>>> big the >>>> tool to make the .html reports failed or quit. Locally, I could create the >>>> .html >>>> with that output.xml >>>> >>>> We have this trouble before where all of a sudden lots more logging comes >>>> in and >>>> it breaks our jobs. >>>> >>>> still getting to the bottom of it... >>>> >>>> JamO >>>> >>>> On 03/21/2017 10:39 AM, Jamo Luhrsen wrote: >>>>> Netvirt, Integration, >>>>> >>>>> we need to figure out and fix what's wrong with the netvirt 3node carbon >>>>> csit. >>>>> >>>>> the jobs are timing out at our jenkins 6h limit. that means we don't >>>>> get any logs either. >>>>> >>>>> This will likely cause a large backlog in our jenkins queue. >>>>> >>>>> If anyone has cycles at the moment to help, catch me on IRC. >>>>> >>>>> Initially, with Alon's help, we know that this job [0] was not seeing >>>>> this trouble. This job [1]. >>>>> >>>>> the difference in ODL patches between the two distros that were used >>>>> have some controller patches that seem cluster related. here are all >>>>> the patches that came in between the two: >>>>> >>>>> controller https://git.opendaylight.org/gerrit/49265BUG-5280: add >>>>> frontend state lifecycle >>>>> controller https://git.opendaylight.org/gerrit/49738BUG-2138: Use >>>>> correct actor context in shard lookup. >>>>> controller https://git.opendaylight.org/gerrit/49663BUG-2138: Fix shard >>>>> registration with ProxyProducers. >>>>> >>>>> From the looks of the console log (all we have) it seems that each >>>>> test case is just taking a long time. I don't know more than that >>>>> at the moment. >>>>> >>>>> JamO >>>>> >>>>> >>>>> >>>>> [0] >>>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/373/ >>>>> [1] >>>>> https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-openstack-newton-nodl-v2-upstream-transparent-carbon/374/ >>>>> >>> _______________________________________________ >>> dev mailing list >>> [email protected] <mailto:[email protected]> >>> https://lists.opendaylight.org/mailman/listinfo/dev >> > _______________________________________________ openflowplugin-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/openflowplugin-dev
