[jira] [Work logged] (BEAM-14332) Improve the workflow of cluster management for Flink on Dataproc

ASF GitHub Bot (Jira) Tue, 03 May 2022 15:31:07 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-14332?focusedWorklogId=765729&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765729
 ]


ASF GitHub Bot logged work on BEAM-14332:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/May/22 22:30
            Start Date: 03/May/22 22:30
    Worklog Time Spent: 10m 
      Work Description: rohdesamuel commented on code in PR #17402:
URL: https://github.com/apache/beam/pull/17402#discussion_r864307237


##########
sdks/python/apache_beam/runners/interactive/testing/mock_env.py:
##########
@@ -0,0 +1,90 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of mocks to isolated the test environment for each Interactive Beam
+test.
+"""
+
+import unittest
+import uuid
+from unittest.mock import patch
+
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive.dataproc.dataproc_cluster_manager import 
DataprocClusterManager
+from apache_beam.runners.interactive.interactive_environment import 
InteractiveEnvironment
+from apache_beam.runners.interactive.testing.mock_ipython import 
mock_get_ipython
+
+
+def isolated_env(cls: unittest.TestCase):

Review Comment:
   Maybe instead of a decorator, have our tests subclass this





Issue Time Tracking
-------------------

    Worklog Id:     (was: 765729)
    Time Spent: 50m  (was: 40m)

> Improve the workflow of cluster management for Flink on Dataproc
> ----------------------------------------------------------------
>
>                 Key: BEAM-14332
>                 URL: https://issues.apache.org/jira/browse/BEAM-14332
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-py-interactive
>            Reporter: Ning
>            Assignee: Ning
>            Priority: P2
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Improve the workflow of cluster management.
> There is an option to configure a default [cluster 
> name|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_beam.py#L366].
>  The existing user flows are:
>  # Use the default cluster name to create a new cluster if none is in use;
>  # Reuse a created cluster that has the default cluster name;
>  # If the default cluster name is configured to a new value, re-apply 1 and 2.
>  A better solution is to 
>  # Create a new cluster implicitly if there is none or explicitly if the user 
> wants one with specific provisioning;
>  # Always default to using the last created cluster.
>  The reasons are:
>  * Cluster name is meaningless to the user when a cluster is just a medium to 
> run OSS runners (as applications) such as Flink or Spark. The cluster could 
> also be running anywhere (on GCP) such as Dataproc, k8s, or even Dataflow 
> itself.
>  * Clusters should be uniquely identified, thus should always have a distinct 
> name. Clusters are managed (created/reused/deleted) behind the scenes by the 
> notebook runtime when the user doesn’t explicitly do so (the capability to 
> explicitly manage clusters is still available). Reusing the same default 
> cluster name is risky when a cluster is deleted by one notebook runtime while 
> another cluster with the same name is created by a different notebook 
> runtime. 
>  * Provide the capability for the user to explicitly provision a cluster.
> Current implementation provisions each cluster at the location specified by 
> GoogleCloudOptions using 3 worker nodes. There is no explicit API to 
> configure the number or shape of workers.
> We could use the WorkerOptions to allow customers to explicitly provision a 
> cluster and expose an explicit API (with UX in notebook extension) for 
> customers to change the size of a cluster connected with their notebook 
> (until we have an auto scaling solution with Dataproc for Flink).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (BEAM-14332) Improve the workflow of cluster management for Flink on Dataproc

Reply via email to