damccorm opened a new issue, #20403:
URL: https://github.com/apache/beam/issues/20403

   The following GBK streaming test cases take too long on Dataflow:
   
    
   
   1) 2GB of 10B records
   
   2) 2GB of 100B records
   
   4) fanout 4 times with 2GB 10-byte records total
   
   5) fanout 8 times with 2GB 10-byte records total
   
    
   
   Each of them needs at least 1 hour to execute, which is way too long for one 
Jenkins job. 
   
   Job's definition: 
[https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy](https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy)
   
   Test pipeline: 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py)
   
   It is probable that those cases are too extreme. The first two cases involve 
grouping 20M unique keys, which is a stressful operation. A solution might be 
to overhaul the cases so that they would be less complex.
   
   Both the current production Dataflow runner and the new Dataflow Runner V2 
were tested.
   
   Imported from Jira 
[BEAM-10774](https://issues.apache.org/jira/browse/BEAM-10774). Original Jira 
may contain additional context.
   Reported by: kamilwu.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to