Hi - Sent from my iPhone
> On Aug 23, 2019, at 11:06 AM, Allen Wittenauer > <a...@effectivemachines.com.invalid> wrote: > > >> On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ipv6g...@gmail.com> wrote: >> The issue is, and I have seen this multiple times over the last few weeks, >> is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase >> flaky tests and similar are running on multiple nodes at the same time. > > The precommit jobs are exercising potential patches/PRs… of course there > are going to be multiples running on different nodes simultaneously. That’s > how CI systems work. Peanut gallery comment. Why was Hadoop invented in the first place? To take long running tests of new spam filtering algorithms and distribute to multiple computers taking tests from days to hours to minutes. I really think there needs to be a balance between simple integration tests and full integration. Here is an example before an RC of Tika and POI are voted on 100,000s of documents are scanned and results are compared. The builds have simpler integration tests. Could the Hadoop ecosystem find a balance between precommit and daily integration? I know it is a messy situation and there is a trade off ... Regards, Dave > >> It >> seems that one PR or 1 commit is triggering a job or jobs that split into >> part jobs that run on multiple nodes. > > Unless there is a misconfiguration (and I haven’t been directly involved > with Hadoop in a year+), that’s incorrect. There is just that much traffic > on these big projects. To put this in perspective, the last time I did some > analysis in March of this year, it works out to be ~10 new JIRAs with patches > attached for Hadoop _a day_. (Assuming an equal distribution across the > year/month/week/day. Which of course isn’t true. Weekdays are higher, > weekends lower.) If there are multiple iterations on those 10, well…. and > then there are the PRs... > >> Just yesterday I saw Hadoop and HBase >> taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours. >> Some of these jobs that take many hours are triggered on a PR or a commit >> that could be something as trivial as a typo. This is unacceptable. > > The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath gave the > ASF machine resources. (I guess that may have happened before you were part > of INFRA.) Also, the job sizes for projects using Yetus are SIGNIFICANTLY > reduced: the full test suite is about 20 hours. Big projects are just that, > big. > >> HBase >> in particular is a Hadoop related project and should be limiting its jobs >> to Hadoop labelled nodes H0-H21, but they are running on any and all nodes. > > Then you should take that up with the HBase project. > >> It is all too familiar to see one job running on a dozen or more executors, >> the build queue is now constantly in the hundreds, despite the fact we have >> nearly 100 nodes. This must stop. > > ’nearly 100 nodes’: but how many of those are dedicated to specific > projects? 1/3 of them are just for Cassandra and Beam. > > Also, take a look at the input on the jobs rather than just looking at the > job names. > > It’s probably also worth pointing out that since INFRA mucked with the > GitHub pull request builder settings, they’ve caused a stampeding herd > problem. As soon as someone runs scan on the project, ALL of the PRs get > triggered at once regardless of if there has been an update to the PR or not. > > >> Meanwhile, Chris informs me his single job to deploy to Nexus has been >> waiting in 3 days. > > It sure sounds like Chris’ job is doing something weird though, given it > appears it is switching nodes and such mid-job based upon their description. > That’s just begging to starve. > > === > > Also, looking at the queue this morning (~11AM EDT), a few observations: > > * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open slots. > > * There are lots of jobs in the queue that don’t support multiple runs. So > they are self-starving and the problem lies with the project, not the > infrastructure. > > * A quick pass show that some of the jobs in the queue are tied to specific > nodes or have such a limited set of nodes as possible hosts that _of course_ > they are going to get starved out. Again, a project-level problem. > > * Just looking at the queue size is clearly not going to provide any real > data as what the problems are without also looking into why those jobs are in > the queue to begin with.