[ https://issues.apache.org/jira/browse/BEAM-8213?focusedWorklogId=318494&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-318494 ]
ASF GitHub Bot logged work on BEAM-8213: ---------------------------------------- Author: ASF GitHub Bot Created on: 25/Sep/19 18:06 Start Date: 25/Sep/19 18:06 Worklog Time Spent: 10m Work Description: youngoli commented on issue #9642: [BEAM-8213] Split up monolithic python preCommit tests on jenkins URL: https://github.com/apache/beam/pull/9642#issuecomment-535142584 > > The monolythic job already runs 4 out of 5 tasks in parallel - I do not see where 80% speedup will come from. > > I think this is just a communication error. We're saying that the sum of the running time of all the 5 jobs will be the same or more than the old monolithic job, but each individual job will be on average 1/5th the time of the old monolithic job. We're not saying that the sum of all 5 jobs will be faster. What Valentyn means is that the monolithic job is already running the tests in parallel within one Jenkins slot. I didn't know that, but if that's the case then splitting the tests wouldn't make them finish faster, it would just continue running the tests in parallel, but using 5 Jenkins slots instead of 1. > Increasing slots per worker may help, but there are some potentially heavy-weight tests, such as portable python precommit tests that bring up Flink, that may cause jenkins VMs to OOM if we run a lot of them in parallel on the same VM. I have heard of a second hand account that parallelizing portable precommit tests 4x on the same Jenkins worker caused OOMs, but did not verify myself. Perhaps not an issue, but we need a reliable way to monitor Jenkins worker health / utilization to be confident. Is there a way to distribute tests to workers with the lowest resource utilization? Or even better, have resource benchmarks for our various test suites so we can avoid sending resource-intensive tests to workers that don't have enough available resources? I don't really know how Jenkins works so that might be a little advanced, but it would definitely avoid that problem and let us increase the number of slots in the workers without hitting resource limits. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 318494) Time Spent: 5.5h (was: 5h 20m) > Run and report python tox tasks separately within Jenkins > --------------------------------------------------------- > > Key: BEAM-8213 > URL: https://issues.apache.org/jira/browse/BEAM-8213 > Project: Beam > Issue Type: Improvement > Components: build-system > Reporter: Chad Dombrova > Priority: Major > Time Spent: 5.5h > Remaining Estimate: 0h > > As a python developer, the speed and comprehensibility of the jenkins > PreCommit job could be greatly improved. > Here are some of the problems > - when a lint job fails, it's not reported in the test results summary, so > even though the job is marked as failed, I see "Test Result (no failures)" > which is quite confusing > - I have to wait for over an hour to discover the lint failed, which takes > about a minute to run on its own > - The logs are a jumbled mess of all the different tasks running on top of > each other > - The test results give no indication of which version of python they use. I > click on Test results, then the test module, then the test class, then I see > 4 tests named the same thing. I assume that the first is python 2.7, the > second is 3.5 and so on. It takes 5 clicks and then reading the log output > to know which version of python a single error pertains to, then I need to > repeat for each failure. This makes it very difficult to discover problems, > and deduce that they may have something to do with python version mismatches. > I believe the solution to this is to split up the single monolithic python > PreCommit job into sub-jobs (possibly using a pipeline with steps). This > would give us the following benefits: > - sub job results should become available as they finish, so for example, > lint results should be available very early on > - sub job results will be reported separately, and there will be a job for > each py2, py35, py36 and so on, so it will be clear when an error is related > to a particular python version > - sub jobs without reports, like docs and lint, will have their own failure > status and logs, so when they fail it will be more obvious what went wrong. > I'm happy to help out once I get some feedback on the desired way forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)