Re: [DISCUSS] Unsustainable situation with ptests

Siddharth Seth Tue, 15 May 2018 19:00:07 -0700

Very nice. There was an effort to get fast and green builds back in 2016.
There wasn't any strict "must be a green build" before commit at the time
though. Instead jiras were filed and the expectation was that they'd be
cited / new ones created pre commit(looking at the jiras now - this was
likely followed for a while, many fixes, and eventually got annoying?).
Think the enforcement step is absolutely required to get to, and maintain a
green build. May want to consider performance characteristics of tests as
well - must complete with X seconds.
Jiras for reference (including test infra improvements which were not done
at the time): HIVE-13503, HIVE-15058, HIVE-14547


This will be painful initially, but eventually it'll be great to be able to
commit without having to scan through a bunch of 'known failures', analyze,
document etc.

On Tue, May 15, 2018 at 5:30 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Wow! Awesome. This is the 3rd time I remember seeing green run in >4yrs. :)
>
> Thanks
> Prasanth
>
> > On May 15, 2018, at 5:28 PM, Jesus Camacho Rodriguez <
> jcama...@apache.org> wrote:
> >
> > We have just had the first clean run in a while:
> > https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/
> >
> > I will continue monitoring follow-up runs.
> >
> > Thanks,
> > -Jesús
> >
> >
> > On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <
> pjayachand...@hortonworks.com> wrote:
> >
> >    Wondering if we can add a state transition from “Patch Available” to
> “Ready To Commit” which can only be triggered by ptest bot on green test
> run.
> >
> >    Thanks
> >    Prasanth
> >
> >
> >
> >    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <
> jcama...@apache.org<mailto:jcama...@apache.org>> wrote:
> >
> >
> >    I have been working on fixing this situation while commits were still
> coming in.
> >
> >    All the tests that have been disabled are in:
> >    https://issues.apache.org/jira/browse/HIVE-19509
> >    I have created new issues to reenable each of them, they are linked
> to that issue.
> >    Maybe I was slightly aggressive disabling some of the tests, however
> that seemed to be the only way to bring the tests failures with age count >
> 1 to zero.
> >
> >    Instead of starting a vote to freeze the commits in another thread, I
> will start a vote to be stricter wrt committing to master, i.e., only
> commit if we get a clean QA run.
> >
> >    We can discuss more about this issue over there.
> >
> >    Thanks,
> >    Jesús
> >
> >
> >
> >    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
> >
> >        Can we please make this freeze conditional, i.e. we unfreeze
> automatically
> >        after ptest is clean (as evidenced by the clean HiveQA run on a
> given
> >        JIRA).
> >
> >        On 18/5/14, 15:16, "Alan Gates"  wrote:
> >
> >> We should do it in a separate thread so that people can see it with the
> >> [VOTE] subject.  Some people use that as a filter in their email to know
> >> when to pay attention to things.
> >>
> >> Alan.
> >>
> >> On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
> >> pjayachand...@hortonworks.com> wrote:
> >>
> >>> Will there be a separate voting thread? Or the voting on this thread is
> >>> sufficient for lock down?
> >>>
> >>> Thanks
> >>> Prasanth
> >>>
> >>>> On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
> >>>>
> >>>> I see there's support for this, but people are still pouring in
> >>> commits.
> >>>> I proposed we have a quick vote on this to lock down the commits
> >>> until we
> >>>> get to green.  That way everyone knows we have drawn the line at a
> >>> specific
> >>>> point.  Any commits after that point would be reverted.  There isn't a
> >>>> category in the bylaws that fits this kind of vote but I suggest lazy
> >>>> majority as the most appropriate one (at least 3 votes, more +1s than
> >>>> -1s).
> >>>>
> >>>> Alan.
> >>>>
> >>>> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
> >>> vih...@cloudera.com>
> >>>> wrote:
> >>>>
> >>>>> I worked on a few quick-fix optimizations in Ptest infrastructure
> >>> over
> >>> the
> >>>>> weekend which reduced the execution run from ~90 min to ~70 min per
> >>> run. I
> >>>>> had to restart Ptest multiple times. I was resubmitting the patches
> >>> which
> >>>>> were in the queue manually, but I may have missed a few. In case you
> >>> have a
> >>>>> patch which is pending pre-commit and you don't see it in the queue,
> >>> please
> >>>>> submit it manually or let me know if you don't have access to the
> >>> jenkins
> >>>>> job. I will continue to work on the sub-tasks in HIVE-19425 and will
> >>> do
> >>>>> some maintenance next weekend as well.
> >>>>>
> >>>>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> >>>>> jcama...@apache.org> wrote:
> >>>>>
> >>>>>> Vineet has already been working on disabling those tests that were
> >>> timing
> >>>>>> out. I am working on disabling those that are generating different q
> >>>>> files
> >>>>>> consistently for last ptests n runs. I am keeping track of all these
> >>>>> tests
> >>>>>> in https://issues.apache.org/jira/browse/HIVE-19509.
> >>>>>>
> >>>>>> -Jesús
> >>>>>>
> >>>>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> >>>>>> pjayachand...@hortonworks.com> wrote:
> >>>>>>
> >>>>>>   +1 on freezing commits until we get repetitive green tests. We
> >>> should
> >>>>>> probably disable (and remember in a jira to reenable then at later
> >>> point)
> >>>>>> tests that are flaky to get repetitive green test runs.
> >>>>>>
> >>>>>>   Thanks
> >>>>>>   Prasanth
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> >>>>> lirui.fu...@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>   +1 to freezing commits until we stabilize
> >>>>>>
> >>>>>>   On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >>>>>>   wrote:
> >>>>>>
> >>>>>>> In order to understand the end-to-end precommit flow I would like
> >>>>> to
> >>>>>> get
> >>>>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> >>>>>> know how
> >>>>>>> can I get that?
> >>>>>>>
> >>>>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >>>>>>> jcama...@apache.org> wrote:
> >>>>>>>
> >>>>>>>> Bq. For the short term green runs, I think we should @Ignore the
> >>>>>> tests
> >>>>>>>> which
> >>>>>>>> are known to be failing since many runs. They are anyways not
> >>>>> being
> >>>>>>>> addressed as such. If people think they are important to be run
> >>>>> we
> >>>>>> should
> >>>>>>>> fix them and only then re-enable them.
> >>>>>>>>
> >>>>>>>> I think that is a good idea, as we would minimize the time that
> >>>>> we
> >>>>>> halt
> >>>>>>>> development. We can create a JIRA where we list all tests that
> >>>>> were
> >>>>>>>> failing, and we have disabled to get the clean run. From that
> >>>>>> moment, we
> >>>>>>>> will have zero tolerance towards committing with failing tests.
> >>>>>> And we
> >>>>>>> need
> >>>>>>>> to pick up those tests that should not be ignored and bring them
> >>>>>> up again
> >>>>>>>> but passing. If there is no disagreement, I can start working on
> >>>>>> that.
> >>>>>>>>
> >>>>>>>> Once I am done, I can try to help with infra tickets too.
> >>>>>>>>
> >>>>>>>> -Jesús
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >>>>>>>>
> >>>>>>>>   +1. I strongly vote for freezing commits and getting our
> >>>>>> testing
> >>>>>>>> coverage in acceptable state.  We have been struggling to
> >>>>> stabilize
> >>>>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
> >>>>>> state
> >>>>>>> would
> >>>>>>>> be unacceptable.
> >>>>>>>>
> >>>>>>>>   Currently there are quite a few test suites which are not
> >>>>> even
> >>>>>>> running
> >>>>>>>> and are being timed out. We have been committing patches (to both
> >>>>>>> branch-3
> >>>>>>>> and master) without test coverage for these tests.
> >>>>>>>>   We should immediately figure out what’s going on before we
> >>>>>> proceed
> >>>>>>>> with commits.
> >>>>>>>>
> >>>>>>>>   For reference following test suites are timing out on
> >>>>> master: (
> >>>>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   TestDbNotificationListener - did not produce a TEST-*.xml
> >>>>> file
> >>>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestNegativeCliDriver - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> >>>>> file
> >>>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> >>>>>> out)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   Vineet
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >>>>>>> vih...@cloudera.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   +1 There are many problems with the test infrastructure and
> >>>>> in
> >>>>>> my
> >>>>>>>> opinion
> >>>>>>>>   it has not become number one bottleneck for the project. I
> >>>>> was
> >>>>>>> looking
> >>>>>>>> at
> >>>>>>>>   the infrastructure yesterday and I think the current
> >>>>>> infrastructure
> >>>>>>>> (even
> >>>>>>>>   its own set of problems) is still under-utilized. I am
> >>>>>> planning to
> >>>>>>>> increase
> >>>>>>>>   the number of threads to process the parallel test batches to
> >>>>>> start
> >>>>>>>> with.
> >>>>>>>>   It needs a restart on the server side. I can do it now, it
> >>>>>> folks are
> >>>>>>>> okay
> >>>>>>>>   with it. Else I can do it over weekend when the queue is
> >>>>> small.
> >>>>>>>>
> >>>>>>>>   I listed the improvements which I thought would be useful
> >>>>> under
> >>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >>>>>>> speaking
> >>>>>>>> I am
> >>>>>>>>   not able to devote as much time as I would like to on it. I
> >>>>>> would
> >>>>>>>>   appreciate if folks who have some more time if they can help
> >>>>>> out.
> >>>>>>>>
> >>>>>>>>   I think to start with https://issues.apache.org/
> >>>>>>> jira/browse/HIVE-19429
> >>>>>>>> will
> >>>>>>>>   help a lot. We need to pack more test runs in parallel and
> >>>>>> containers
> >>>>>>>>   provide good isolation.
> >>>>>>>>
> >>>>>>>>   For the short term green runs, I think we should @Ignore the
> >>>>>> tests
> >>>>>>>> which
> >>>>>>>>   are known to be failing since many runs. They are anyways not
> >>>>>> being
> >>>>>>>>   addressed as such. If people think they are important to be
> >>>>>> run we
> >>>>>>>> should
> >>>>>>>>   fix them and only then re-enable them.
> >>>>>>>>
> >>>>>>>>   Also, I feel we need light-weight test run which we can run
> >>>>>> locally
> >>>>>>>> before
> >>>>>>>>   submitting it for the full-suite. That way minor issues with
> >>>>>> the
> >>>>>>> patch
> >>>>>>>> can
> >>>>>>>>   be handled locally. May be create a profile which runs a
> >>>>>> subset of
> >>>>>>>>   important tests which are consistent. We can apply some label
> >>>>>> that
> >>>>>>>>   pre-checkin-local tests are runs successful and only then we
> >>>>>> submit
> >>>>>>>> for the
> >>>>>>>>   full-suite.
> >>>>>>>>
> >>>>>>>>   More thoughts are welcome. Thanks for starting this
> >>>>>> conversation.
> >>>>>>>>
> >>>>>>>>   On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >>>>>>>>   jcama...@apache.org> wrote:
> >>>>>>>>
> >>>>>>>>   I believe we have reached a state (maybe we did reach it a
> >>>>>> while ago)
> >>>>>>>> that
> >>>>>>>>   is not sustainable anymore, as there are so many tests
> >>>>> failing
> >>>>>> /
> >>>>>>>> timing out
> >>>>>>>>   that it is not possible to verify whether a patch is breaking
> >>>>>> some
> >>>>>>>> critical
> >>>>>>>>   parts of the system or not. It also seems to me that due to
> >>>>> the
> >>>>>>>> timeouts
> >>>>>>>>   (maybe due to infra, maybe not), ptest runs are taking even
> >>>>>> longer
> >>>>>>> than
> >>>>>>>>   usual, which in turn creates even longer queue of patches.
> >>>>>>>>
> >>>>>>>>   There is an ongoing effort to improve ptests usability (
> >>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425), but apart
> >>>>>> from
> >>>>>>>> that,
> >>>>>>>>   we need to make an effort to stabilize existing tests and
> >>>>>> bring that
> >>>>>>>>   failure count to zero.
> >>>>>>>>
> >>>>>>>>   Hence, I am suggesting *we stop committing any patch before
> >>>>> we
> >>>>>> get a
> >>>>>>>> green
> >>>>>>>>   run*. If someone thinks this proposal is too radical, please
> >>>>>> come up
> >>>>>>>> with
> >>>>>>>>   an alternative, because I do not think it is OK to have the
> >>>>>> ptest
> >>>>>>> runs
> >>>>>>>> in
> >>>>>>>>   their current state. Other projects of certain size (e.g.,
> >>>>>> Hadoop,
> >>>>>>>> Spark)
> >>>>>>>>   are always green, we should be able to do the same.
> >>>>>>>>
> >>>>>>>>   Finally, once we get to zero failures, I suggest we are less
> >>>>>> tolerant
> >>>>>>>> with
> >>>>>>>>   committing without getting a clean ptests run. If there is a
> >>>>>> failure,
> >>>>>>>> we
> >>>>>>>>   need to fix it or revert the patch that caused it, then we
> >>>>>> continue
> >>>>>>>>   developing.
> >>>>>>>>
> >>>>>>>>   Please, let’s all work together as a community to fix this
> >>>>>> issue,
> >>>>>>> that
> >>>>>>>> is
> >>>>>>>>   the only way to get to zero quickly.
> >>>>>>>>
> >>>>>>>>   Thanks,
> >>>>>>>>   Jesús
> >>>>>>>>
> >>>>>>>>   PS. I assume the flaky tests will come into the discussion.
> >>>>>> Let´s see
> >>>>>>>>   first how many of those we have, then we can work to find a
> >>>>>> fix.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   --
> >>>>>>   Best regards!
> >>>>>>   Rui Li
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: [DISCUSS] Unsustainable situation with ptests

Reply via email to