cozos commented on issue #23932: URL: https://github.com/apache/beam/issues/23932#issuecomment-1331361408
At least on the RDD-based Spark runner (i.e. not Dataset/Structured Streaming): > What is the purpose of SDK service? All Beam pipelines are converted into a Java Spark RDD pipeline - if you are writing your DoFns in Python, Java RDDs cannot execute your Python code. So the SDK Harness contains your Python environment and Spark executes Python logic code there. > does it mean that each spark worker node should have it's own beam sdk service? Spark workers communicate with SDK Harnesses via the gRPC Fn API. It's better to deploy them on the same host as the Spark worker in order to minimize network IO (data has to be sent back and forth between worker and SDK Harness for processing). You can deploy them on the same node as a Docker container or a process (see the `--environment_type` option). However `--environment_type EXTERNAL` has its own advantages, as the SDK Harness does not have to share resources (such as CPU and memory) with the Spark worker. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
