HyukjinKwon commented on a change in pull request #28085:
URL: https://github.com/apache/spark/pull/28085#discussion_r412644686
##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1135,6 +1136,27 @@ private[spark] class DAGScheduler(
}
}
+ /**
+ * `PythonRunner` needs to know what the pyspark memory and cores settings
are for the profile
+ * being run. Pass them in the local properties of the task if it's set for
the stage profile.
+ */
+ private def addPysparkConfigsToProperties(stage: Stage, properties:
Properties): Unit = {
Review comment:
sorry, nit:`addPysparkConfigsToProperties` ->
`addPysparkConfigsToProperties`
##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+ """
+ .. note:: Evolving
+
+ An Executor resource request. This is used in conjunction with the
ResourceProfile to
+ programmatically specify the resources needed for an RDD that will be
applied at the
+ stage level.
+
+ This is used to specify what the resource requirements are for an Executor
and how
+ Spark can find out specific details about those resources. Not all the
parameters are
+ required for every resource type. Resources like GPUs are supported and
have same limitations
+ as using the global spark configs spark.executor.resource.gpu.*. The
amount, discoveryScript,
+ and vendor parameters for resources are all the same parameters a user
would specify through the
+ configs: spark.executor.resource.{resourceName}.{amount, discoveryScript,
vendor}.
+
+ For instance, a user wants to allocate an Executor with GPU resources on
YARN. The user has
+ to specify the resource name (gpu), the amount or number of GPUs per
Executor,
+ the discovery script would be specified so that when the Executor starts
up it can
+ discovery what GPU addresses are available for it to use because YARN
doesn't tell
+ Spark that, then vendor would not be used because its specific for
Kubernetes.
+
+ See the configuration and cluster specific docs for more details.
+
+ Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+ :param resourceName: Name of the resource
+ :param amount: Amount requesting
+ :param discoveryScript: Optional script used to discover the resources.
This is required on some
+ cluster managers that don't tell Spark the addresses of the resources
+ allocated. The script runs on Executors startup to discover the
addresses
+ of the resources available.
+ :param vendor: Vendor, required for some cluster managers
+
+ .. versionadded:: 3.1.0
+ """
+ def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+ self._name = resourceName
+ self._amount = amount
+ self._discoveryScript = discoveryScript
Review comment:
nit `_discoveryScript` -> `_discovery_script`
##########
File path: python/pyspark/rdd.py
##########
@@ -2587,6 +2616,7 @@ def pipeline_func(split, iterator):
self._prev_jrdd = prev._prev_jrdd # maintain the pipeline
self._prev_jrdd_deserializer = prev._prev_jrdd_deserializer
self.is_cached = False
+ self.has_resourceProfile = False
Review comment:
nit `has_resourceProfile` -> `has_resource_profile`. Basically the logic
is that we should keep everything `a_b_c` when it's not an API inspired from
Java (or Scala API). IIRC, this logic complies pep8.
##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+ """
+ .. note:: Evolving
+
+ An Executor resource request. This is used in conjunction with the
ResourceProfile to
+ programmatically specify the resources needed for an RDD that will be
applied at the
+ stage level.
+
+ This is used to specify what the resource requirements are for an Executor
and how
+ Spark can find out specific details about those resources. Not all the
parameters are
+ required for every resource type. Resources like GPUs are supported and
have same limitations
+ as using the global spark configs spark.executor.resource.gpu.*. The
amount, discoveryScript,
+ and vendor parameters for resources are all the same parameters a user
would specify through the
+ configs: spark.executor.resource.{resourceName}.{amount, discoveryScript,
vendor}.
+
+ For instance, a user wants to allocate an Executor with GPU resources on
YARN. The user has
+ to specify the resource name (gpu), the amount or number of GPUs per
Executor,
+ the discovery script would be specified so that when the Executor starts
up it can
+ discovery what GPU addresses are available for it to use because YARN
doesn't tell
+ Spark that, then vendor would not be used because its specific for
Kubernetes.
+
+ See the configuration and cluster specific docs for more details.
+
+ Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+ :param resourceName: Name of the resource
+ :param amount: Amount requesting
+ :param discoveryScript: Optional script used to discover the resources.
This is required on some
+ cluster managers that don't tell Spark the addresses of the resources
+ allocated. The script runs on Executors startup to discover the
addresses
+ of the resources available.
+ :param vendor: Vendor, required for some cluster managers
+
+ .. versionadded:: 3.1.0
+ """
+ def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+ self._name = resourceName
+ self._amount = amount
+ self._discoveryScript = discoveryScript
+ self._vendor = vendor
+
+ @property
+ def resourceName(self):
+ return self._name
+
+ @property
+ def amount(self):
+ return self._amount
+
+ @property
+ def discoveryScript(self):
+ return self._discoveryScript
+
+ @property
+ def vendor(self):
+ return self._vendor
+
+
+class ExecutorResourceRequests(object):
+
+ """
+ .. note:: Evolving
+
+ A set of Executor resource requests. This is used in conjunction with the
+ ResourceProfileBuilder to programmatically specify the resources needed
for an RDD
Review comment:
nit: `` :class:`pyspark.resource.ResourceProfileBuilder` ``
##########
File path: python/pyspark/resource/executorrequests.py
##########
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.resource.taskrequests import TaskResourceRequest
+from pyspark.util import _parse_memory
+
+
+class ExecutorResourceRequest(object):
+ """
+ .. note:: Evolving
+
+ An Executor resource request. This is used in conjunction with the
ResourceProfile to
+ programmatically specify the resources needed for an RDD that will be
applied at the
+ stage level.
+
+ This is used to specify what the resource requirements are for an Executor
and how
+ Spark can find out specific details about those resources. Not all the
parameters are
+ required for every resource type. Resources like GPUs are supported and
have same limitations
+ as using the global spark configs spark.executor.resource.gpu.*. The
amount, discoveryScript,
+ and vendor parameters for resources are all the same parameters a user
would specify through the
+ configs: spark.executor.resource.{resourceName}.{amount, discoveryScript,
vendor}.
+
+ For instance, a user wants to allocate an Executor with GPU resources on
YARN. The user has
+ to specify the resource name (gpu), the amount or number of GPUs per
Executor,
+ the discovery script would be specified so that when the Executor starts
up it can
+ discovery what GPU addresses are available for it to use because YARN
doesn't tell
+ Spark that, then vendor would not be used because its specific for
Kubernetes.
+
+ See the configuration and cluster specific docs for more details.
+
+ Use `pyspark.ExecutorResourceRequests` class as a convenience API.
+
+ :param resourceName: Name of the resource
+ :param amount: Amount requesting
+ :param discoveryScript: Optional script used to discover the resources.
This is required on some
+ cluster managers that don't tell Spark the addresses of the resources
+ allocated. The script runs on Executors startup to discover the
addresses
+ of the resources available.
+ :param vendor: Vendor, required for some cluster managers
+
+ .. versionadded:: 3.1.0
+ """
+ def __init__(self, resourceName, amount, discoveryScript="", vendor=""):
+ self._name = resourceName
+ self._amount = amount
+ self._discoveryScript = discoveryScript
+ self._vendor = vendor
+
+ @property
+ def resourceName(self):
+ return self._name
+
+ @property
+ def amount(self):
+ return self._amount
+
+ @property
+ def discoveryScript(self):
+ return self._discoveryScript
+
+ @property
+ def vendor(self):
+ return self._vendor
+
+
+class ExecutorResourceRequests(object):
+
+ """
+ .. note:: Evolving
+
+ A set of Executor resource requests. This is used in conjunction with the
+ ResourceProfileBuilder to programmatically specify the resources needed
for an RDD
+ that will be applied at the stage level.
+
+ .. versionadded:: 3.1.0
+ """
+ _CORES = "cores"
+ _MEMORY = "memory"
+ _OVERHEAD_MEM = "memoryOverhead"
+ _PYSPARK_MEM = "pyspark.memory"
Review comment:
maybe `pyspark.memory ` -> `pysparkMemory`. Was it a mistake or there
was a reason for this naming?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]