+1 to freezing commits until we stabilize

On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar <vih...@cloudera.com>
wrote:

> In order to understand the end-to-end precommit flow I would like to get
> access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
> can I get that?
>
> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> jcama...@apache.org> wrote:
>
> > Bq. For the short term green runs, I think we should @Ignore the tests
> > which
> > are known to be failing since many runs. They are anyways not being
> > addressed as such. If people think they are important to be run we should
> > fix them and only then re-enable them.
> >
> > I think that is a good idea, as we would minimize the time that we halt
> > development. We can create a JIRA where we list all tests that were
> > failing, and we have disabled to get the clean run. From that moment, we
> > will have zero tolerance towards committing with failing tests. And we
> need
> > to pick up those tests that should not be ignored and bring them up again
> > but passing. If there is no disagreement, I can start working on that.
> >
> > Once I am done, I can try to help with infra tickets too.
> >
> > -Jesús
> >
> >
> > On 5/11/18, 1:57 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:
> >
> >     +1. I strongly vote for freezing commits and getting our testing
> > coverage in acceptable state.  We have been struggling to stabilize
> > branch-3 due to test failures and releasing Hive 3.0 in current state
> would
> > be unacceptable.
> >
> >     Currently there are quite a few test suites which are not even
> running
> > and are being timed out. We have been committing patches (to both
> branch-3
> > and master) without test coverage for these tests.
> >     We should immediately figure out what’s going on before we proceed
> > with commits.
> >
> >     For reference following test suites are timing out on master: (
> > https://issues.apache.org/jira/browse/HIVE-19506)
> >
> >
> >     TestDbNotificationListener - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
> >
> >
> >     Vineet
> >
> >
> >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> vih...@cloudera.com
> > <mailto:vih...@cloudera.com>> wrote:
> >
> >     +1 There are many problems with the test infrastructure and in my
> > opinion
> >     it has not become number one bottleneck for the project. I was
> looking
> > at
> >     the infrastructure yesterday and I think the current infrastructure
> > (even
> >     its own set of problems) is still under-utilized. I am planning to
> > increase
> >     the number of threads to process the parallel test batches to start
> > with.
> >     It needs a restart on the server side. I can do it now, it folks are
> > okay
> >     with it. Else I can do it over weekend when the queue is small.
> >
> >     I listed the improvements which I thought would be useful under
> >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> speaking
> > I am
> >     not able to devote as much time as I would like to on it. I would
> >     appreciate if folks who have some more time if they can help out.
> >
> >     I think to start with https://issues.apache.org/
> jira/browse/HIVE-19429
> > will
> >     help a lot. We need to pack more test runs in parallel and containers
> >     provide good isolation.
> >
> >     For the short term green runs, I think we should @Ignore the tests
> > which
> >     are known to be failing since many runs. They are anyways not being
> >     addressed as such. If people think they are important to be run we
> > should
> >     fix them and only then re-enable them.
> >
> >     Also, I feel we need light-weight test run which we can run locally
> > before
> >     submitting it for the full-suite. That way minor issues with the
> patch
> > can
> >     be handled locally. May be create a profile which runs a subset of
> >     important tests which are consistent. We can apply some label that
> >     pre-checkin-local tests are runs successful and only then we submit
> > for the
> >     full-suite.
> >
> >     More thoughts are welcome. Thanks for starting this conversation.
> >
> >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >     jcama...@apache.org<mailto:jcama...@apache.org>> wrote:
> >
> >     I believe we have reached a state (maybe we did reach it a while ago)
> > that
> >     is not sustainable anymore, as there are so many tests failing /
> > timing out
> >     that it is not possible to verify whether a patch is breaking some
> > critical
> >     parts of the system or not. It also seems to me that due to the
> > timeouts
> >     (maybe due to infra, maybe not), ptest runs are taking even longer
> than
> >     usual, which in turn creates even longer queue of patches.
> >
> >     There is an ongoing effort to improve ptests usability (
> >     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
> > that,
> >     we need to make an effort to stabilize existing tests and bring that
> >     failure count to zero.
> >
> >     Hence, I am suggesting *we stop committing any patch before we get a
> > green
> >     run*. If someone thinks this proposal is too radical, please come up
> > with
> >     an alternative, because I do not think it is OK to have the ptest
> runs
> > in
> >     their current state. Other projects of certain size (e.g., Hadoop,
> > Spark)
> >     are always green, we should be able to do the same.
> >
> >     Finally, once we get to zero failures, I suggest we are less tolerant
> > with
> >     committing without getting a clean ptests run. If there is a failure,
> > we
> >     need to fix it or revert the patch that caused it, then we continue
> >     developing.
> >
> >     Please, let’s all work together as a community to fix this issue,
> that
> > is
> >     the only way to get to zero quickly.
> >
> >     Thanks,
> >     Jesús
> >
> >     PS. I assume the flaky tests will come into the discussion. Let´s see
> >     first how many of those we have, then we can work to find a fix.
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
Best regards!
Rui Li

Reply via email to