VictorPlusC commented on a change in pull request #16741:
URL: https://github.com/apache/beam/pull/16741#discussion_r807423696
##########
File path:
sdks/python/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py
##########
@@ -124,10 +131,23 @@ def create_flink_cluster(self) -> None:
'config': {
'software_config': {
'optional_components': ['DOCKER', 'FLINK']
+ },
+ 'gce_cluster_config': {
+ 'metadata': {
+ 'flink-start-yarn-session': 'true'
+ },
+ 'service_account_scopes': [
+ 'https://www.googleapis.com/auth/cloud-platform'
+ ]
+ },
+ 'endpoint_config': {
+ 'enable_http_port_access': True
}
}
}
self.create_cluster(cluster)
+ self.master_url = self.get_master_url(
+ self.master_url_identifier, default=False)
Review comment:
Quite a few things have changed with the PR, so here's the updated
answer:
The master_url and identifier (which we're calling 'cluster_metadata') are
used for the following use cases:
- During pipeline runtime, when we use the FlinkRunner, we'll detect some
information (cluster default name from the Interactive Environment, project and
region from the pipeline options). We can use this information to build the
identifier and check if it corresponds to any existing master_urls via the
bidirectional mapping. If it does, then we can simply skip the cluster creation
stage and begin using the cluster.
- We now instantiate instances of DataprocClusterManager with the
identifier, and if the cluster_metadata corresponds to some existing
master_url, we can assign the value to the DataprocClusterManager immediately
during instantiation without having to search for it again. We have this
functionality in place since we want to have each pipeline mapped to a separate
instance of DataprocClusterManager.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]