I think this would be a major improvement for our nightly test coverage. Dare I ask, could we provision a fully, distributed Kubernetes cluster for our testing?
Regardless, +1 from me on your proposal. On Thu, Mar 10, 2022 at 5:32 PM Sean Busbey <bus...@apache.org> wrote: > Hi folks! > > Quick background: all of the automated testing for nightly and PR > contributions is now running on a dedicated Jenkins instance ( > ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of > the ci-hadoop controller and thanks to a new anonymous donor we were > able to add an additional 10 nodes. > > The new donor gave enough of a contribution that we can make some > decisions as a community about expanding these resources further. > > The new 10 nodes run 2 executors each (same as our old nodes), have > this shape, and are considered "medium" by the provider we're getting > them from: > > 64GB DDR4 ECC RAM > Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake. > 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1) > > To give an idea of what the current testing workload of our project > looks like, we can use the built in jenkins utilization tooling for > our general purpose label 'hbase'[0]. > > If you look at the last 4 days of utilization[1] we have a couple of > periods with a small backlog of ~2 executors worth of work. The > measurements are very rolled up so it's hard to tell specifics. On the > chart of the last day or so[2] we can see two periods of 1-2 hours > where we have a backlog of 2-4 executors worth of work. > > for comparison, the chart for immediately after we had to burn off ~3 > days of backlog because our worker nodes were offline back at the end > of february shows no queue[3]. > > I think we could possibly benefit from adding 1-2 additional medium > worker nodes, but the long periods where we have ~half our executors > idle makes me think some refactoring or timing changes would maybe be > a better way to improve our current steady state workload. > > One thing that we currently lack is robust integration testing of a > cluster deployment. At the moment our nightly jobs spin up a test that > makes a single node version of Hadoop and then a single node hbase on > top of it. It then does a trivial functionality test[4]. > > The host provider we use for jenkins worker nodes has a large node shaped > like: > 160GB RAM > Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading > 2 x 960GB NVMe drives as RAID1 > > A pretty short path to improvement would be if we got 1 or 2 of these > nodes and moved our integration test to use the minikube project[5] to > run a local kubernetes environment. We could then deploy a small but > multinode Hadoop and HBase cluster and run e.g. ITBLL against it in > addition to whatever checking of cli commands, shell expectations, > etc. > > What do y'all think? > > [0]: https://ci-hbase.apache.org/label/hbase/load-statistics > [1]: > https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png > [2]: > https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png > [3]: > https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png > [4]: > https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh > [5]: https://minikube.sigs.k8s.io/docs/ >