rohdesamuel commented on a change in pull request #16741:
URL: https://github.com/apache/beam/pull/16741#discussion_r807302390
##########
File path:
sdks/python/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py
##########
@@ -65,13 +66,18 @@ def __init__(
cluster_name)
self._cluster_name = cluster_name
else:
- self._cluster_name = self.DEFAULT_NAME
+ self._cluster_name = ie.current_env().clusters.default_cluster_name
Review comment:
I like how users can set the default cluster name, but the way it is
used here feel awkward. Specifically because its primary use is to know whether
or not to fail when creating a new cluster (i.e. defaults can be reused,
customs cannot). My advice is to instead parameterize this behavior (preferably
as a function parameter). We also want to minimize our use of global variables.
Otherwise, we can get into very weird states.
##########
File path:
sdks/python/apache_beam/runners/interactive/interactive_environment.py
##########
@@ -549,6 +589,50 @@ def set_cached_source_signature(self, pipeline, signature):
def get_cached_source_signature(self, pipeline):
return self._cached_source_signature.get(str(id(pipeline)), set())
+ def set_dataproc_cluster_manager(self, pipeline):
Review comment:
In setters it's best to give to the module what should be set. Can you
please move out the construction of the cluster manager?
##########
File path:
sdks/python/apache_beam/runners/interactive/interactive_environment.py
##########
@@ -549,6 +589,50 @@ def set_cached_source_signature(self, pipeline, signature):
def get_cached_source_signature(self, pipeline):
return self._cached_source_signature.get(str(id(pipeline)), set())
+ def set_dataproc_cluster_manager(self, pipeline):
+ """Sets the instance of DataprocClusterManager to be used by the
+ pipeline.
+ """
+ if self._is_in_ipython:
+ warnings.filterwarnings(
+ 'ignore',
+ 'options is deprecated since First stable release. References to '
+ '<pipeline>.options will not be supported',
+ category=DeprecationWarning)
+ project_id = (pipeline.options.view_as(GoogleCloudOptions).project)
+ region = (pipeline.options.view_as(GoogleCloudOptions).region)
+ cluster_name = self.clusters.default_cluster_name
+ cluster_manager = DataprocClusterManager(
+ project_id=project_id, region=region, cluster_name=cluster_name)
+ self.clusters._dataproc_cluster_managers[str(id(pipeline))] =
cluster_manager
+
+ def get_dataproc_cluster_manager(self, pipeline):
+ """Gets the instance of DataprocClusterManager currently used by the
+ pipeline.
+ """
+ return self.clusters._dataproc_cluster_managers.get(str(id(pipeline)),
None)
+
+ def evict_dataproc_cluster_manager(self, pipeline):
+ """Evicts and pops the instance of DataprocClusterManager that is currently
+ used by the pipeline. Noop if the given pipeline is absent from the
+ environment or if the DataprocClusterManager instance is being used by
+ another pipeline. If no pipeline is specified, evicts for all pipelines.
+ """
+ if pipeline:
+ cluster_manager = self.clusters._dataproc_cluster_managers.pop(
+ str(id(pipeline)), None)
+ if cluster_manager:
+ master_url = cluster_manager.master_url
+ if len(self.clusters.get_pipelines_using_master_url( \
+ master_url)) == 1:
+ del self.clusters._master_urls[master_url]
+ del self.clusters._master_urls_to_pipelines[master_url]
+ return
+ self.clusters._dataproc_cluster_managers.clear()
+ self.clusters._master_urls.clear()
+ self.clusters._master_urls.inverse.clear()
Review comment:
By clearing the inverse dict you're leaking the implementation. This
puts the responsibility of knowing how to properly clear the object to the
user, which can easily lead to bugs. My suggestion is to override the clear()
method in the bidict to clear both maps.
##########
File path:
sdks/python/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py
##########
@@ -124,10 +131,23 @@ def create_flink_cluster(self) -> None:
'config': {
'software_config': {
'optional_components': ['DOCKER', 'FLINK']
+ },
+ 'gce_cluster_config': {
+ 'metadata': {
+ 'flink-start-yarn-session': 'true'
+ },
+ 'service_account_scopes': [
+ 'https://www.googleapis.com/auth/cloud-platform'
+ ]
+ },
+ 'endpoint_config': {
+ 'enable_http_port_access': True
}
}
}
self.create_cluster(cluster)
+ self.master_url = self.get_master_url(
+ self.master_url_identifier, default=False)
Review comment:
What is the master_url and identifier used for? Right now it's for
logging purposes, will it be used for something else in the future?
##########
File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
##########
@@ -329,6 +359,101 @@ def record(self, pipeline):
return recording_manager.record_pipeline()
+class Clusters():
+ """An interface for users to modify the pipelines that are being run by the
+ Interactive Environment.
+
+ Methods of the Interactive Beam Clusters class can be accessed via:
+ interactive_beam.clusters
+
+ Example of calling the Interactive Beam clusters describe method::
+ interactive_beam.clusters.describe()
+ """
+ def __init__(self) -> None:
+ """Instantiates default values for Dataproc cluster interactions.
+ """
+ self._default_cluster_name = 'interactive-beam-cluster'
+ self._master_urls = bidict()
+ self._dataproc_cluster_managers = {}
+ self._master_urls_to_pipelines = {}
+
+ def describe(self, pipeline: Optional[beam.Pipeline]=None) -> dict:
+ """Returns a description of the cluster associated to the given pipeline.
+
+ If no pipeline is given then this returns a dictionary of descriptions for
+ all pipelines.
+ """
+
+ description = ie.current_env().describe_all_clusters()
+ if pipeline:
+ return description.get(pipeline, None)
+ return description
+
+ @property
+ def default_cluster_name(self) -> str:
+ """The default name to be used when creating Dataproc clusters.
+
+ Defaults to 'interactive-beam-cluster'.
+ """
+ return self._default_cluster_name
+
+ @default_cluster_name.setter
+ def default_cluster_name(self, value: bool) -> None:
+ """Sets the default name to be used when creating Dataproc clusters.
+
+ Defaults to 'interactive-beam-cluster'.
+
+ Example of assigning a default_cluster_name::
+ interactive_beam.clusters.default_cluster_name = 'my-beam-cluster'
+ """
+ self._default_cluster_name = value
+
+ def cleanup(self, pipeline: beam.Pipeline, forcefully=False) -> None:
Review comment:
s/forcefully/force generally APIs use "force" to indicate this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]