Abacn opened a new issue, #25944: URL: https://github.com/apache/beam/issues/25944
### What needs to happen? It is found that the cost of generating synthetic source is as expensive as write to sink in a Python IO performance test https://github.com/apache/beam/issues/19084#issuecomment-1343373709 . This prevents from the benchmark reporting accurate performance data. Ran a pipeline with synthetic source only, cloud profile shows - assign random seed alone costs 30% of total cpu time - generate bytes costs 20% of total cpu time <img width="1695" alt="image" src="https://user-images.githubusercontent.com/8010435/227224125-846dcd7b-7afa-4aa7-b6ff-689a4f3782ad.png"> This is because Python built in random generator uses a Mersenne Twister with fairly large state ((doc)[https://docs.python.org/3/library/random.html]), thus assigning seed is slow. Generating bytes is also slow as it involves many memory allocations. In contrast, Java built in random generator (used by Java SDK's synthetic source) uses a linear congruential generator (LCG) by Donald Knuth ((doc)[https://docs.oracle.com/javase/8/docs/api/java/util/Random.html]) which is way faster. I compared the performance between builtin generating random bytes and cythonized LCG implemenration, generating 1M random bytes of 1024 bytes. The latter shows more than 10 x performance gain (run time 10 s / < 1 s). This doubles the performance of synthetic pipeline. We should be able to switch to the LCG Once this is done Python synthetic pipeline has minimum cost of generating bytes themselves and can then be used to benchmarking the peformance of SDF. ### Issue Priority Priority: 3 (nice-to-have improvement) ### Issue Components - [X] Component: Python SDK - [ ] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
