Dealing with expensive jenkins + Dataflow jobs

Łukasz Gajowy Mon, 14 Jan 2019 08:11:39 -0800

Hi all,

one problem we need to solve while working with load tests we currently
develop is that we don't really know how much GCP/Jenkins resources can we
occupy. We did some initial testing with
beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:


- 1 000 000 000 (~ 23 GB) synthetic record
- 10 fanouts
- 10 dataflow workers (--maxNumWorkers)

the total job time exceeds 4 hours. It seems too much for such a small load
test. Additionally, we plan to add much bigger tests for other core
operations too. The proposal [2] describes only few of them.

The questions are:
1. how many workers can we assign to this job without starving the other
jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers for
such job be fine either?
2. given the plans that we are going to add more and more load tests soon,
do you think it is a good idea to create a separate GCP project + separate
Jenkins workers for load testing purposes only? This would avoid starvation
of critical tests (post commits, pre-commits, etc). Or maybe there is
another solution that will bring such isolation? Is such isolation needed?

Ad 2: Please note that we will also need to host Flink/Spark clusters later
on GKE/Dataproc (not decided yet).

[1]
https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/
[2] https://s.apache.org/load-test-basic-operations


Thanks,
Łukasz

Dealing with expensive jenkins + Dataflow jobs

Reply via email to