Abacn opened a new issue, #25944:
URL: https://github.com/apache/beam/issues/25944

   ### What needs to happen?
   
   It is found that the cost of generating synthetic source is as expensive as 
write to sink in a Python IO performance test 
https://github.com/apache/beam/issues/19084#issuecomment-1343373709 . This 
prevents from the benchmark reporting accurate performance data.
   
   Ran a pipeline with synthetic source only, cloud profile shows
   
   - assign random seed alone costs 30% of total cpu time
   - generate bytes costs 20% of total cpu time
   
   <img width="1695" alt="image" 
src="https://user-images.githubusercontent.com/8010435/227224125-846dcd7b-7afa-4aa7-b6ff-689a4f3782ad.png";>
    
   This is because Python built in random generator uses a Mersenne Twister 
with fairly large state ((doc)[https://docs.python.org/3/library/random.html]), 
thus assigning seed is slow. Generating bytes is also slow as it involves many 
memory allocations. In contrast, Java built in random generator (used by Java 
SDK's synthetic source) uses a linear congruential generator (LCG) by Donald 
Knuth ((doc)[https://docs.oracle.com/javase/8/docs/api/java/util/Random.html]) 
which is way faster.
   
   I compared the performance between builtin generating random bytes and 
cythonized LCG implemenration, generating 1M random bytes of 1024 bytes. The 
latter shows more than 10 x performance gain (run time 10 s / < 1 s). This 
doubles the performance of synthetic pipeline. We should be able to switch to 
the LCG
   
   Once this is done Python synthetic pipeline has minimum cost of generating 
bytes themselves and can then be used to benchmarking the peformance of SDF.
   
   
   
   
   
   ### Issue Priority
   
   Priority: 3 (nice-to-have improvement)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to