I think this would be a major improvement for our nightly test coverage.

Dare I ask, could we provision a fully, distributed Kubernetes cluster for
our testing?

Regardless, +1 from me on your proposal.

On Thu, Mar 10, 2022 at 5:32 PM Sean Busbey <bus...@apache.org> wrote:

> Hi folks!
>
> Quick background: all of the automated testing for nightly and PR
> contributions is now running on a dedicated Jenkins instance (
> ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> the ci-hadoop controller and thanks to a new anonymous donor we were
> able to add an additional 10 nodes.
>
> The new donor gave enough of a contribution that we can make some
> decisions as a community about expanding these resources further.
>
> The new 10 nodes run 2 executors each (same as our old nodes), have
> this shape, and are considered "medium" by the provider we're getting
> them from:
>
> 64GB DDR4 ECC RAM
> Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
>
> To give an idea of what the current testing workload of our project
> looks like, we can use the built in jenkins utilization tooling for
> our general purpose label 'hbase'[0].
>
> If you look at the last 4 days of utilization[1] we have a couple of
> periods with a small backlog of ~2 executors worth of work. The
> measurements are very rolled up so it's hard to tell specifics. On the
> chart of the last day or so[2] we can see two periods of 1-2 hours
> where we have a backlog of 2-4 executors worth of work.
>
> for comparison, the chart for immediately after we had to burn off ~3
> days of backlog because our worker nodes were offline back at the end
> of february shows no queue[3].
>
> I think we could possibly benefit from adding 1-2 additional medium
> worker nodes, but the long periods where we have ~half our executors
> idle makes me think some refactoring or timing changes would maybe be
> a better way to improve our current steady state workload.
>
> One thing that we currently lack is robust integration testing of a
> cluster deployment. At the moment our nightly jobs spin up a test that
> makes a single node version of Hadoop and then a single node hbase on
> top of it. It then does a trivial functionality test[4].
>
> The host provider we use for jenkins worker nodes has a large node shaped
> like:
> 160GB RAM
> Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> 2 x 960GB NVMe drives as RAID1
>
> A pretty short path to improvement would be if we got 1 or 2 of these
> nodes and moved our integration test to use the minikube project[5] to
> run a local kubernetes environment. We could then deploy a small but
> multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> addition to whatever checking of cli commands, shell expectations,
> etc.
>
> What do y'all think?
>
> [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> [1]:
> https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> [2]:
> https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> [3]:
> https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> [4]:
> https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> [5]: https://minikube.sigs.k8s.io/docs/
>

Reply via email to