correction in my email below. I meant "in my opinion it *has now* become number one bottleneck for the project" (worst place for a typo I guess)
On Fri, May 11, 2018 at 1:46 PM, Vihang Karajgaonkar <vih...@cloudera.com> wrote: > +1 There are many problems with the test infrastructure and in my opinion > it has not become number one bottleneck for the project. I was looking at > the infrastructure yesterday and I think the current infrastructure (even > its own set of problems) is still under-utilized. I am planning to increase > the number of threads to process the parallel test batches to start with. > It needs a restart on the server side. I can do it now, it folks are okay > with it. Else I can do it over weekend when the queue is small. > > I listed the improvements which I thought would be useful under > https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I > am not able to devote as much time as I would like to on it. I would > appreciate if folks who have some more time if they can help out. > > I think to start with https://issues.apache.org/jira/browse/HIVE-19429 > will help a lot. We need to pack more test runs in parallel and containers > provide good isolation. > > For the short term green runs, I think we should @Ignore the tests which > are known to be failing since many runs. They are anyways not being > addressed as such. If people think they are important to be run we should > fix them and only then re-enable them. > > Also, I feel we need light-weight test run which we can run locally before > submitting it for the full-suite. That way minor issues with the patch can > be handled locally. May be create a profile which runs a subset of > important tests which are consistent. We can apply some label that > pre-checkin-local tests are runs successful and only then we submit for the > full-suite. > > More thoughts are welcome. Thanks for starting this conversation. > > On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez < > jcama...@apache.org> wrote: > >> I believe we have reached a state (maybe we did reach it a while ago) >> that is not sustainable anymore, as there are so many tests failing / >> timing out that it is not possible to verify whether a patch is breaking >> some critical parts of the system or not. It also seems to me that due to >> the timeouts (maybe due to infra, maybe not), ptest runs are taking even >> longer than usual, which in turn creates even longer queue of patches. >> >> There is an ongoing effort to improve ptests usability ( >> https://issues.apache.org/jira/browse/HIVE-19425), but apart from that, >> we need to make an effort to stabilize existing tests and bring that >> failure count to zero. >> >> Hence, I am suggesting *we stop committing any patch before we get a >> green run*. If someone thinks this proposal is too radical, please come up >> with an alternative, because I do not think it is OK to have the ptest runs >> in their current state. Other projects of certain size (e.g., Hadoop, >> Spark) are always green, we should be able to do the same. >> >> Finally, once we get to zero failures, I suggest we are less tolerant >> with committing without getting a clean ptests run. If there is a failure, >> we need to fix it or revert the patch that caused it, then we continue >> developing. >> >> Please, let’s all work together as a community to fix this issue, that is >> the only way to get to zero quickly. >> >> Thanks, >> Jesús >> >> PS. I assume the flaky tests will come into the discussion. Let´s see >> first how many of those we have, then we can work to find a fix. >> >> >> >