Abacn commented on issue #19084: URL: https://github.com/apache/beam/issues/19084#issuecomment-1343373709
The performance of WindowInto may worth investigation as I noticed that Python text IO write has worse performance than Java SDK, and the slowest DoFn is WindowInto(GlobalWindows()): Java metrics: http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=4 Python metrics: http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1&viewPanel=5 Java Read ~20s; Java Write ~30s; Python Read ~100s; Python Write 270s Two noticable difference from job graph - Python generate records is much slower than java generatesequence. apache_beam.testing.synthetic_pipeline.SyntheticSource is slow (wall time 14 min 10 sec). Nevertheless this part does not count in write_time metrics. - apache_beam.transforms.core.WindowIntoFn is slow (wall time 14 min 49 sec). The Java write pipeline graph looks like this: <img width="248" alt="image" src="https://user-images.githubusercontent.com/8010435/206569366-7396666f-eab3-4b78-844b-9af5669e4e78.png"> The Python write pipeline graph looks like this: <img width="244" alt="image" src="https://user-images.githubusercontent.com/8010435/206569230-f6a84976-257d-4e44-9aa4-c9a94ce30cb1.png"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
