Introduction For Apache Impala's (incubating) "ASF milestone 1", we need to make progress on the mega-task of having public-facing build and test infrastructure. It's not a requirement that we finish this for ASF milestone 1. For now, I propose we focus on researching public options available and presenting findings and conclusions. The full task is tracked at https://issues.cloudera.org/browse/IMPALA-3228
I'm looking for volunteers to help with this assessment. If you don't want to volunteer, can't volunteer, or aren't interested in the decisions ultimately made, you don't need to read the rest of this document. Document Outline This document is necessarily long. There is a lot to consider when choosing a public build/test provider, and it's better to clearly list out important points as opposed to just assuming everyone is on the same page. First, I prioritize the sorts of jobs we may choose eventually to have available to all committers in a public build/test infrastructure. Second, I list features existing Apache Impala (incubating) build/test infrastructure jobs have. When I talk about "existing" infrastructure, I mean that inside Cloudera, Inc., since to my knowledge that is all that exists in any sort of formal nature for Apache Impala (incubating). Third, I list additional requirements and features that have not been implemented but must be considered. Forth, I list potential public build and test service provider candidates and things to assess given the information provided in the earlier sections. Fifth, I have a task list, for which volunteers may choose to sign up. I. Job Priorities These are listed in order to give consideration for Existing Build Environment Characteristics below. First Priority (ASF Milestone 2) 1. Pre-commit verification job, to gate patch acceptance based on build's pass/fail status. Among the Apache Impala (incubating) dev community, this is colloquially known as "Gerrit verify merge" or "Gerrit verify only" (GVM, GVO). Second Priority (future consideration) 1. Regular execution of exhaustive tests 2. Data load snapshot publication; will speed up run of builds, but not absolutely needed Third Priority (future consideration) Listed in no particular order of priority: - Apache Impala (incubating) compiled with ASAN + tests - compiled for release + tests - Apache Impala (incubating) configured with legacy aggregations and joins + tests - configured to run on a local filesystem + tests - compiled for code coverage + tests - Private builds (i.e., for testing changes but not merging or cherry-picking after passing) Out of Scope - Apache Impala (incubating) on S3 or Isilon, alternative filesystems and appliances within Cloudera internal network - Anything that interacts with Cloudera, Inc. CDH clusters, like stress or performance - Anything not otherwise included as part of any priority II. Existing Build Environment Characteristics Here, I try to list the characteristics of the internal Cloudera / Jenkins build environment. While it's likely that many providers' solutions also support most if not all these features, it'd be good to get these written down. Assessors must consider these. These are in no particular order. Soft / Administrative - Anyone employed at Cloudera working on Apache Impala (incubating) can view or alter the jobs (promotes the idea that everyone can enhance the jobs and theoretically helps discourage de facto sysadmins or "experts") - Individuals at Cloudera are not wholly on our own to maintain internal Jenkins: while we may change our jobs, Jenkins proper is administrated by a separate group. If the entire Jenkins infrastructure goes down, they are on call to fix it. Technical - The ability to define job parameters (for job reuse) - The ability run builds in parallel (for efficiency/productivity) - The ability to queue up build requests if there are not enough available resources to run the build immediately - The ability to capture and display the contents of stderr / stdout (for quick failure triage/debugging) - The ability to collect artifacts (for more detailed debugging / forensic analysis) - Retention some builds and artifacts up to a point (useful for binary search for bug hunting; "how did this work before?" investigations; etc.) - Build triggers including time-based or event-based (needed if we ever want more than just a GVM/GVO job) - Underlying GNU/Linux distribution with Bash and Python (to be able to bootstrap the so-called "toolchain", download requirements, and bootstrap virtual environment) - Underlying GNU/Linux distribution is supported by the Apache Impala (incubating) toolchain (to be able to compile the project) - Provides passwordless sudo with no restrictions (Cloudera Jenkins provides this; whether this is a good thing is debatable, but it can come handy if it's the only way to install additional packages, or if a job needs to modify a ulimit.) - Configurable notification of pass/failure/etc. (helps with manual build triage) - Obvious pass/failure status on some splash screen / dashboard (nice to see "state of the world" or "history of a build") - Configurable automatic abort if the job appears stuck (hard to spot these, so it's nice to have some automatic process in place here) - The ability to build the job in phases or "steps" (this allows some post-build proper step to run unconditionally, for example, even if some previous step fails) - The ability to manage disk space (clean up after itself) - SSH access granted to any committer (useful when forensic evidence is lacking or to look at a hung build) - Can spin up slaves that satisfy Apache Impala's (incubating) disk and memory requirements, and have CPU such that full builds+tests take 4-12 hours. Note the time-to-execute range depends on both the compiler options chosen and also which tests are run. - Can interact with Gerrit (https://gerrit.cloudera.org) III. Additional Build Environment Requirements and Considerations In no particular order, here we list additional requirements that we're not taking advantage of, but should. We also list requirements that take into account the public nature of Apache Impala (incubating). Assessors must consider these. Soft / Administrative - All committers should have equal access to the build environment infra - Cloudera cannot expose internal services to the public - Cloudera pays for Kudu's GCE public infra, but it's totally separate from Cloudera - Cloudstack is another ASF project using external build/test infra - Not all of a project's build/test infra must be public. This is the case with Kudu. Note that the Kudu pre-commit job is crafted in such a way that it's a good gating for finding bugs. - Potential hardware donations from Cloudera to ASF should be considered for all of ASF and not exclusively for Apache Impala (incubating). ASF frowns on donations for a specific project, and we should expect any donations to go into a generic resource pool for use for any ASF project. - Separate external infra for Apache Impala (incubating) is borderline with ASF, but probably fine. The key is ensuring that if Cloudera (or any "main backer") were ever to pull funding, then the project shouldn't be made homeless. This can be achieved via transparency on how the infra is maintained so that someone else can come in and do it. In our case, I think this can be satisfied by a combination of keeping our jobs in SCM (see Technical just below) and providing documentation for any surrounding administrivia (e.g, "Here's how to set up your SSH key to update the jobs on the provider"). Technical - Modular way to build and maintain jobs via SCM, e.g., Jenkins job DSL or Jenkins Job Builder (see Notes below). Programmatically building our jobs and maintaining them that way means we don't have the problems of clone-edit proliferation, and it's simple to update a lot of jobs at once. - Jobs can be staged as "test jobs" and tested before being incorporated into mainline. - Jobs can easily be created for multiple branches, either feature branches or maintenance release branches. - Infra is upgrade-able (and not stuck on a 6 year old version) - System requirements: It's possible some of the public offerings are non-starters--or at least their free offerings are--because their systems' specs are inferior to Apache Impala's (incubating) system requirements. To that end, we need to get a reasonable ballpark of how much disk and memory we tend to use in our build and tests, and if we have less CPU than the EC2 instances available to those of us within Cloudera, what the cost in additional build and test time is. Note: Apache Impala (incubating) hardware requirements for CDH clusters are aggressive compared to the so-called "minicluster" (see Notes below). IV. Public Build / Test Infra Offerings Things to Assess - What are the system specs of their free offering? - What are the restrictions of the free offering (job, build cap; writable repo; etc.)? - What is the cost of a paid offering providing it will have sufficient CPU, disk, memory specs? Please clarify unit (e.g., dollars per hour per build node). - Do the public or paid offerings offer feature parity with features in section II? - Do the public or paid offerings make it possible to satisfy the requirements and considerations in section III? Choices This is not an exhaustive list. If people know or endorse others, speak up. If you suggest another, your consent to be chosen to be its assessor for this project is implied. - ASF Jenkins: https://wiki.apache.org/general/Jenkins - Travis: https://travis-ci.org/ - Cloudbees: https://www.cloudbees.com/ - something similar to Kudu's GCE Setup (this requires extra research) http://104.196.14.100/ - Others? https://en.wikipedia.org/wiki/Comparison_of_continuous_integration_software V. Immediate Task List For any task that says "the more the better", please reply with your points. For any task that says "anyone", please reply to say you're taking it on. - Quick audit of section II above to ensure I didn't miss anything needed: the more the better - Quick audit of section III above to ensure I didn't miss anything needed: the more the better - Share current/past experiences with any public build/infra service providers listed or not listed in section IV needed: the more the better - Determine ballpark Apache Impala (incubating) build/test system requirements (this somewhat blocks the below and should be chosen sooner rather than later) needed: anyone - Assess ASF Jenkins needed: anyone - Assess Travis needed: anyone - Assess Cloudbees needed: anyone - Research Kudu GCE setup (contact me about things to ask Kudu) needed: anyone Thanks for reading. References and Notes https://issues.cloudera.org/browse/IMPALA-3228 http://jenkins.buildacloud.org/ https://wiki.jenkins-ci.org/display/JENKINS/Job+DSL+Plugin http://docs.openstack.org/infra/jenkins-job-builder/ http://www.cloudera.com/documentation/enterprise/latest/topics/impala_prereqs1.html#prereqs_hardware
