[GitHub] [spark] HyukjinKwon commented on a change in pull request #28085: [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests

GitBox Tue, 21 Apr 2020 21:25:25 -0700


HyukjinKwon commented on a change in pull request #28085:
URL: https://github.com/apache/spark/pull/28085#discussion_r412644686




##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1135,6 +1136,27 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * `PythonRunner` needs to know what the pyspark memory and cores settings 
are for the profile
+   * being run. Pass them in the local properties of the task if it's set for 
the stage profile.
+   */
+  private def addPysparkConfigsToProperties(stage: Stage, properties: 
Properties): Unit = {

Review comment:
       sorry, nit:`addPysparkConfigsToProperties` -> 
`addPysparkConfigsToProperties`

##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+    """
+    .. note:: Evolving
+
+    An Executor resource request. This is used in conjunction with the 
ResourceProfile to
+    programmatically specify the resources needed for an RDD that will be 
applied at the
+    stage level.
+
+    This is used to specify what the resource requirements are for an Executor 
and how
+    Spark can find out specific details about those resources. Not all the 
parameters are
+    required for every resource type. Resources like GPUs are supported and 
have same limitations
+    as using the global spark configs spark.executor.resource.gpu.*. The 
amount, discoveryScript,
+    and vendor parameters for resources are all the same parameters a user 
would specify through the
+    configs: spark.executor.resource.{resourceName}.{amount, discoveryScript, 
vendor}.
+
+    For instance, a user wants to allocate an Executor with GPU resources on 
YARN. The user has
+    to specify the resource name (gpu), the amount or number of GPUs per 
Executor,
+    the discovery script would be specified so that when the Executor starts 
up it can
+    discovery what GPU addresses are available for it to use because YARN 
doesn't tell
+    Spark that, then vendor would not be used because its specific for 
Kubernetes.
+
+    See the configuration and cluster specific docs for more details.
+
+    Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+    :param resourceName: Name of the resource
+    :param amount: Amount requesting
+    :param discoveryScript: Optional script used to discover the resources. 
This is required on some
+        cluster managers that don't tell Spark the addresses of the resources
+        allocated. The script runs on Executors startup to discover the 
addresses
+        of the resources available.
+    :param vendor: Vendor, required for some cluster managers
+
+    .. versionadded:: 3.1.0
+    """
+    def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+        self._name = resourceName
+        self._amount = amount
+        self._discoveryScript = discoveryScript

Review comment:
       nit `_discoveryScript` -> `_discovery_script`

##########
File path: python/pyspark/rdd.py
##########
@@ -2587,6 +2616,7 @@ def pipeline_func(split, iterator):
             self._prev_jrdd = prev._prev_jrdd  # maintain the pipeline
             self._prev_jrdd_deserializer = prev._prev_jrdd_deserializer
         self.is_cached = False
+        self.has_resourceProfile = False

Review comment:
       nit `has_resourceProfile` -> `has_resource_profile`. Basically the logic 
is that we should keep everything `a_b_c` when it's not an API inspired from 
Java (or Scala API). IIRC, this logic complies pep8.

##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+    """
+    .. note:: Evolving
+
+    An Executor resource request. This is used in conjunction with the 
ResourceProfile to
+    programmatically specify the resources needed for an RDD that will be 
applied at the
+    stage level.
+
+    This is used to specify what the resource requirements are for an Executor 
and how
+    Spark can find out specific details about those resources. Not all the 
parameters are
+    required for every resource type. Resources like GPUs are supported and 
have same limitations
+    as using the global spark configs spark.executor.resource.gpu.*. The 
amount, discoveryScript,
+    and vendor parameters for resources are all the same parameters a user 
would specify through the
+    configs: spark.executor.resource.{resourceName}.{amount, discoveryScript, 
vendor}.
+
+    For instance, a user wants to allocate an Executor with GPU resources on 
YARN. The user has
+    to specify the resource name (gpu), the amount or number of GPUs per 
Executor,
+    the discovery script would be specified so that when the Executor starts 
up it can
+    discovery what GPU addresses are available for it to use because YARN 
doesn't tell
+    Spark that, then vendor would not be used because its specific for 
Kubernetes.
+
+    See the configuration and cluster specific docs for more details.
+
+    Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+    :param resourceName: Name of the resource
+    :param amount: Amount requesting
+    :param discoveryScript: Optional script used to discover the resources. 
This is required on some
+        cluster managers that don't tell Spark the addresses of the resources
+        allocated. The script runs on Executors startup to discover the 
addresses
+        of the resources available.
+    :param vendor: Vendor, required for some cluster managers
+
+    .. versionadded:: 3.1.0
+    """
+    def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+        self._name = resourceName
+        self._amount = amount
+        self._discoveryScript = discoveryScript
+        self._vendor = vendor
+
+    @property
+    def resourceName(self):
+        return self._name
+
+    @property
+    def amount(self):
+        return self._amount
+
+    @property
+    def discoveryScript(self):
+        return self._discoveryScript
+
+    @property
+    def vendor(self):
+        return self._vendor
+
+
+class ExecutorResourceRequests(object):
+
+    """
+    .. note:: Evolving
+
+    A set of Executor resource requests. This is used in conjunction with the
+    ResourceProfileBuilder to programmatically specify the resources needed 
for an RDD

Review comment:
       nit: `` :class:`pyspark.resource.ResourceProfileBuilder` ``

##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+    """
+    .. note:: Evolving
+
+    An Executor resource request. This is used in conjunction with the 
ResourceProfile to
+    programmatically specify the resources needed for an RDD that will be 
applied at the
+    stage level.
+
+    This is used to specify what the resource requirements are for an Executor 
and how
+    Spark can find out specific details about those resources. Not all the 
parameters are
+    required for every resource type. Resources like GPUs are supported and 
have same limitations
+    as using the global spark configs spark.executor.resource.gpu.*. The 
amount, discoveryScript,
+    and vendor parameters for resources are all the same parameters a user 
would specify through the
+    configs: spark.executor.resource.{resourceName}.{amount, discoveryScript, 
vendor}.
+
+    For instance, a user wants to allocate an Executor with GPU resources on 
YARN. The user has
+    to specify the resource name (gpu), the amount or number of GPUs per 
Executor,
+    the discovery script would be specified so that when the Executor starts 
up it can
+    discovery what GPU addresses are available for it to use because YARN 
doesn't tell
+    Spark that, then vendor would not be used because its specific for 
Kubernetes.
+
+    See the configuration and cluster specific docs for more details.
+
+    Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+    :param resourceName: Name of the resource
+    :param amount: Amount requesting
+    :param discoveryScript: Optional script used to discover the resources. 
This is required on some
+        cluster managers that don't tell Spark the addresses of the resources
+        allocated. The script runs on Executors startup to discover the 
addresses
+        of the resources available.
+    :param vendor: Vendor, required for some cluster managers
+
+    .. versionadded:: 3.1.0
+    """
+    def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+        self._name = resourceName
+        self._amount = amount
+        self._discoveryScript = discoveryScript
+        self._vendor = vendor
+
+    @property
+    def resourceName(self):
+        return self._name
+
+    @property
+    def amount(self):
+        return self._amount
+
+    @property
+    def discoveryScript(self):
+        return self._discoveryScript
+
+    @property
+    def vendor(self):
+        return self._vendor
+
+
+class ExecutorResourceRequests(object):
+
+    """
+    .. note:: Evolving
+
+    A set of Executor resource requests. This is used in conjunction with the
+    ResourceProfileBuilder to programmatically specify the resources needed 
for an RDD
+    that will be applied at the stage level.
+
+    .. versionadded:: 3.1.0
+    """
+    _CORES = "cores"
+    _MEMORY = "memory"
+    _OVERHEAD_MEM = "memoryOverhead"
+    _PYSPARK_MEM = "pyspark.memory"

Review comment:
       maybe `pyspark.memory ` -> `pysparkMemory`. Was it a mistake or there 
was a reason for this naming?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28085: [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests

Reply via email to