cozos commented on issue #23932:
URL: https://github.com/apache/beam/issues/23932#issuecomment-1331361408

   At least on the RDD-based Spark runner (i.e. not Dataset/Structured 
Streaming): 
   > What is the purpose of SDK service? 
   All Beam pipelines are converted into a Java Spark RDD pipeline - if you are 
writing your DoFns in Python, Java RDDs cannot execute your Python code. So the 
SDK Harness contains your Python environment and Spark executes Python logic 
code there.
   
   > does it mean that each spark worker node should have it's own beam sdk 
service?
   Spark workers communicate with SDK Harnesses via the gRPC Fn API. It's 
better to deploy them on the same host as the Spark worker in order to minimize 
network IO (data has to be sent back and forth between worker and SDK Harness 
for processing). You can deploy them on the same node as a Docker container or 
a process (see the `--environment_type` option). However `--environment_type 
EXTERNAL` has its own advantages, as the SDK Harness does not have to share 
resources (such as CPU and memory) with the Spark worker. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to