Agreeing with Robert about "what is it we're trying to test?". Would a smaller performance test find the same issues, faster and more reliably?
We have seen issues with the apache-beam-testing project exceeding quota during dataflow jobs, resulting in spurious failures during precommits and postcommits. 32 workers per dataflow jobs sounds fine, provided there are not too many concurrent dataflow jobs. Not all the tests have the number of workers limited, so I've seen some with ~80 workers. For non-performance tests, it would seem we should be able to drastically limit the number of workers, which should provide more room for performance tests/ On Wed, Jan 23, 2019 at 7:10 AM Robert Bradshaw <[email protected]> wrote: > I like the idea of creating separate project(s) for load tests so as > to not compete with other tests and the standard development cycle. > > As for how many workers is too many, I would take the track "what is > it we're trying to test?" Unless your stress-testing the shuffle > itself, much of what Beam does is linearly parallizable with the > number of machines. Of course one will still want to run over real, > large data sets, but not every load test needs this every time. More > interesting could be to try out running at 2x and 4x the data, with 2x > and 4x the machines, and seeing where we fail to be linear. > > (As an aside, 4 hours x 10 workers seems like a lot for 23GB of > data...or is it 230GB once you've fanned out?) > > On Wed, Jan 23, 2019 at 3:33 PM Łukasz Gajowy <[email protected]> wrote: > > > > Hi, > > > > pinging this thread (maybe some folks missed it). What do you think > about those concerns/ideas? > > > > Łukasz > > > > pon., 14 sty 2019 o 17:11 Łukasz Gajowy <[email protected]> napisał(a): > >> > >> Hi all, > >> > >> one problem we need to solve while working with load tests we currently > develop is that we don't really know how much GCP/Jenkins resources can we > occupy. We did some initial testing with > beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for: > >> > >> - 1 000 000 000 (~ 23 GB) synthetic record > >> - 10 fanouts > >> - 10 dataflow workers (--maxNumWorkers) > >> > >> the total job time exceeds 4 hours. It seems too much for such a small > load test. Additionally, we plan to add much bigger tests for other core > operations too. The proposal [2] describes only few of them. > >> > >> The questions are: > >> 1. how many workers can we assign to this job without starving the > other jobs? Are 32 workers for a single Dataflow job fine? Would 64 workers > for such job be fine either? > >> 2. given the plans that we are going to add more and more load tests > soon, do you think it is a good idea to create a separate GCP project + > separate Jenkins workers for load testing purposes only? This would avoid > starvation of critical tests (post commits, pre-commits, etc). Or maybe > there is another solution that will bring such isolation? Is such isolation > needed? > >> > >> Ad 2: Please note that we will also need to host Flink/Spark clusters > later on GKE/Dataproc (not decided yet). > >> > >> [1] > https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Java_LoadTests_GroupByKey_Dataflow_Small_PR/ > >> [2] https://s.apache.org/load-test-basic-operations > >> > >> > >> Thanks, > >> Łukasz > >> >
