[spark] branch master updated (78ed4cc -> 7630787)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 78ed4cc [SPARK-38575][INFRA] Duduplicate branch specification in GitHub Actions workflow add 7630787 [SPARK-38575][INFRA][FOLLOW-UP] Fix ** to '**' in ansi_sql_mode_test.yml No new revisions were added by this update. Summary of changes: .github/workflows/ansi_sql_mode_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7d1ff01 -> 78ed4cc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7d1ff01 [SPARK-38556][PYTHON] Disable Pandas usage logging for method calls inside @contextmanager functions add 78ed4cc [SPARK-38575][INFRA] Duduplicate branch specification in GitHub Actions workflow No new revisions were added by this update. Summary of changes: .github/workflows/ansi_sql_mode_test.yml | 2 +- .github/workflows/build_and_test.yml | 21 +++-- 2 files changed, 12 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-38556][PYTHON] Disable Pandas usage logging for method calls inside @contextmanager functions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new c284faa [SPARK-38556][PYTHON] Disable Pandas usage logging for method calls inside @contextmanager functions c284faa is described below commit c284faad2d7d3b813c1c94c612b814c129b6dad3 Author: Yihong He AuthorDate: Thu Mar 17 10:03:42 2022 +0900 [SPARK-38556][PYTHON] Disable Pandas usage logging for method calls inside @contextmanager functions ### What changes were proposed in this pull request? Wrap AbstractContextManager returned by contexmanager decorator function in function calls. The comment in the code change explain why it uses a wrapper class instead of wrapping functions of AbstractContextManager directly. ### Why are the changes needed? Currently, method calls inside contextmanager functions are treated as external for **with** statements. For example, the below code records config.set_option calls inside ps.option_context(...) ```python with ps.option_context("compute.ops_on_diff_frames", True): pass ``` We should disable usage logging for calls inside contextmanager functions to improve accuracy of the usage data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing tests - Manual test by running `./bin/pyspark` and verified the output: ``` >>> sc.setLogLevel("info") >>> import pyspark.pandas as ps 22/03/15 17:10:50 INFO Log4jUsageLogger: pandasOnSparkImported=1.0, tags=List(), blob= >>> with ps.option_context("compute.ops_on_diff_frames", True): ... pass ... 22/03/15 17:11:17 INFO Log4jUsageLogger: pandasOnSparkFunctionCalled=1.0, tags=List(pandasOnSparkFunction=option_context(*args: Any) -> Iterator[NoneType], className=config, status=success), blob={"duration": 0.161525994123} 22/03/15 17:11:18 INFO Log4jUsageLogger: initialConfigLogging=1.0, tags=List(sparkApplicationId=local-1647360645198, sparkExecutionId=null, sparkJobGroupId=null), blob={"spark.sql.warehouse.dir":"file:/Users/yihong.he/spark/spark-warehouse","spark.executor.extraJavaOptions":"-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL [...] 22/03/15 17:11:19 INFO Log4jUsageLogger: pandasOnSparkFunctionCalled=1.0, tags=List(pandasOnSparkFunction=option_context.__enter__(), className=config, status=success), blob={"duration": 1594.156939978} 22/03/15 17:11:19 INFO Log4jUsageLogger: pandasOnSparkFunctionCalled=1.0, tags=List(pandasOnSparkFunction=option_context.__exit__(type, value, traceback), className=config, status=success), blob={"duration": 12.61017002086} ``` Closes #35861 from heyihong/SPARK-38556. Authored-by: Yihong He Signed-off-by: Hyukjin Kwon (cherry picked from commit 7d1ff01299c88a1aadfac032ea0b3ef87f4ae50d) Signed-off-by: Hyukjin Kwon --- python/pyspark/instrumentation_utils.py | 30 ++ 1 file changed, 30 insertions(+) diff --git a/python/pyspark/instrumentation_utils.py b/python/pyspark/instrumentation_utils.py index 908f5cb..b9aacf6 100644 --- a/python/pyspark/instrumentation_utils.py +++ b/python/pyspark/instrumentation_utils.py @@ -21,6 +21,7 @@ import inspect import threading import importlib import time +from contextlib import AbstractContextManager from types import ModuleType from typing import Tuple, Union, List, Callable, Any, Type @@ -30,6 +31,24 @@ __all__: List[str] = [] _local = threading.local() +class _WrappedAbstractContextManager(AbstractContextManager): +def __init__( +self, acm: AbstractContextManager, class_name: str, function_name: str, logger: Any +): +self._enter_func = _wrap_function( +class_name, "{}.__enter__".format(function_name), acm.__enter__, logger +) +self._exit_func = _wrap_function( +class_name, "{}.__exit__".format(function_name), acm.__exit__, logger +) + +def __enter__(self): # type: ignore[no-untyped-def] +return self._enter_func() + +def __exit__(self, exc_type, exc_val, exc_tb): # type: ignore[no-untyped-def] +return self._exit_func(exc_type, exc_val, exc_tb) + + def _wrap_function(class_name: str, function_name: str, func: Callable, logger: Any) -> Callable: signature = inspect.signature(func) @@ -44,6 +63,17 @@ def _wrap_function(class_name: str, function_name: str, func: Callable, logger: start = time.perf_counter() try: res = func(*args, **kwargs) +if isinst
[spark] branch master updated (b348acd -> 7d1ff01)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b348acd [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace` add 7d1ff01 [SPARK-38556][PYTHON] Disable Pandas usage logging for method calls inside @contextmanager functions No new revisions were added by this update. Summary of changes: python/pyspark/instrumentation_utils.py | 30 ++ 1 file changed, 30 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b16a9e9 -> b348acd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b16a9e9 [SPARK-38572][BUILD] Setting version to 3.4.0-SNAPSHOT add b348acd [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace` No new revisions were added by this update. Summary of changes: python/pyspark/pandas/series.py| 72 ++ python/pyspark/pandas/tests/test_series.py | 29 ++-- 2 files changed, 89 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6d3e8eb -> b16a9e9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6d3e8eb [SPARK-38555][NETWORK][SHUFFLE] Avoid contention and get or create clientPools quickly in the TransportClientFactory add b16a9e9 [SPARK-38572][BUILD] Setting version to 3.4.0-SNAPSHOT No new revisions were added by this update. Summary of changes: R/pkg/DESCRIPTION | 2 +- assembly/pom.xml | 2 +- common/kvstore/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml | 2 +- common/network-yarn/pom.xml| 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml| 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/avro/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml| 2 +- external/kafka-0-10-token-provider/pom.xml | 2 +- external/kafka-0-10/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml| 2 +- graphx/pom.xml | 2 +- hadoop-cloud/pom.xml | 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml| 2 +- mllib/pom.xml | 2 +- pom.xml| 2 +- project/MimaExcludes.scala | 5 + python/pyspark/version.py | 2 +- repl/pom.xml | 2 +- resource-managers/kubernetes/core/pom.xml | 2 +- resource-managers/kubernetes/integration-tests/pom.xml | 2 +- resource-managers/mesos/pom.xml| 2 +- resource-managers/yarn/pom.xml | 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 40 files changed, 45 insertions(+), 40 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38555][NETWORK][SHUFFLE] Avoid contention and get or create clientPools quickly in the TransportClientFactory
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6d3e8eb [SPARK-38555][NETWORK][SHUFFLE] Avoid contention and get or create clientPools quickly in the TransportClientFactory 6d3e8eb is described below commit 6d3e8eba055bb2809f17d74aa3442b18bf7beb16 Author: weixiuli AuthorDate: Wed Mar 16 17:01:33 2022 -0500 [SPARK-38555][NETWORK][SHUFFLE] Avoid contention and get or create clientPools quickly in the TransportClientFactory ### What changes were proposed in this pull request? Avoid contention and get or create clientPools quickly in the TransportClientFactory. ### Why are the changes needed? Avoid contention for getting or creating clientPools, and clean up the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unittests. Closes #35860 from weixiuli/SPARK-38555-NETWORK. Authored-by: weixiuli Signed-off-by: Mridul Muralidharan gmail.com> --- .../org/apache/spark/network/client/TransportClientFactory.java | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java index 43408d4..6fb9923 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java +++ b/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java @@ -155,12 +155,8 @@ public class TransportClientFactory implements Closeable { InetSocketAddress.createUnresolved(remoteHost, remotePort); // Create the ClientPool if we don't have it yet. -ClientPool clientPool = connectionPool.get(unresolvedAddress); -if (clientPool == null) { - connectionPool.putIfAbsent(unresolvedAddress, new ClientPool(numConnectionsPerPeer)); - clientPool = connectionPool.get(unresolvedAddress); -} - +ClientPool clientPool = connectionPool.computeIfAbsent(unresolvedAddress, +key -> new ClientPool(numConnectionsPerPeer)); int clientIndex = rand.nextInt(numConnectionsPerPeer); TransportClient cachedClient = clientPool.clients[clientIndex]; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4ff40c1 -> 5967f29)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4ff40c1 [SPARK-38561][K8S][DOCS] Add doc for `Customized Kubernetes Schedulers` add 5967f29 [SPARK-38545][BUILD] Upgarde scala-maven-plugin from 4.4.0 to 4.5.6 No new revisions were added by this update. Summary of changes: pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38561][K8S][DOCS] Add doc for `Customized Kubernetes Schedulers`
This is an automated email from the ASF dual-hosted git repository. holden pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4ff40c1 [SPARK-38561][K8S][DOCS] Add doc for `Customized Kubernetes Schedulers` 4ff40c1 is described below commit 4ff40c10f02f6e0735ce6554f7338489d8555bce Author: Yikun Jiang AuthorDate: Wed Mar 16 11:12:54 2022 -0700 [SPARK-38561][K8S][DOCS] Add doc for `Customized Kubernetes Schedulers` ### What changes were proposed in this pull request? This is PR to doc for basic framework capability for Customized Kubernetes Schedulers. ### Why are the changes needed? Guide user how to use spark on kubernetes custom scheduler ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #35869 from Yikun/SPARK-38561. Authored-by: Yikun Jiang Signed-off-by: Holden Karau --- docs/running-on-kubernetes.md | 19 +++ 1 file changed, 19 insertions(+) diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index de37e22..d1b2fcd 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -1713,6 +1713,25 @@ spec: image: will-be-overwritten ``` + Customized Kubernetes Schedulers for Spark on Kubernetes + +Spark allows users to specify a custom Kubernetes schedulers. + +1. Specify scheduler name. + + Users can specify a custom scheduler using spark.kubernetes.scheduler.name or + spark.kubernetes.{driver/executor}.scheduler.name configuration. + +2. Specify scheduler related configurations. + + To configure the custom scheduler the user can use [Pod templates](#pod-template), add labels (spark.kubernetes.{driver,executor}.label.*) and/or annotations (spark.kubernetes.{driver/executor}.annotation.*). + +3. Specify scheduler feature step. + + Users may also consider to use spark.kubernetes.{driver/executor}.pod.featureSteps to support more complex requirements, including but not limited to: + - Create additional Kubernetes custom resources for driver/executor scheduling. + - Set scheduler hints according to configuration or existing Pod info dynamically. + ### Stage Level Scheduling Overview Stage level scheduling is supported on Kubernetes when dynamic allocation is enabled. This also requires spark.dynamicAllocation.shuffleTracking.enabled to be enabled since Kubernetes doesn't support an external shuffle service at this time. The order in which containers for different profiles is requested from Kubernetes is not guaranteed. Note that since dynamic allocation on Kubernetes requires the shuffle tracking feature, this means that executors from previous stages t [...] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 8405ec3 [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable 8405ec3 is described below commit 8405ec352dbed6a3199fc2af3c60fae7186d15b5 Author: Adam Binford AuthorDate: Wed Mar 16 10:54:18 2022 -0500 [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable ### What changes were proposed in this pull request? Add a new config to set the memory overhead factor for drivers and executors. Currently the memory overhead is hard coded to 10% (except in Kubernetes), and the only way to set it higher is to set it to a specific memory amount. ### Why are the changes needed? In dynamic environments where different people or use cases need different memory requirements, it would be helpful to set a higher memory overhead factor instead of having to set a higher specific memory overhead value. The kubernetes resource manager already makes this configurable. This makes it configurable across the board. ### Does this PR introduce _any_ user-facing change? No change to default behavior, just adds a new config users can change. ### How was this patch tested? New UT to check the memory calculation. Closes #35504 from Kimahriman/yarn-configurable-memory-overhead-factor. Authored-by: Adam Binford Signed-off-by: Thomas Graves (cherry picked from commit 71e2110b799220adc107c9ac5ce737281f2b65cc) Signed-off-by: Thomas Graves --- .../main/scala/org/apache/spark/SparkConf.scala| 4 +- .../org/apache/spark/internal/config/package.scala | 28 ++ docs/configuration.md | 30 ++- docs/running-on-kubernetes.md | 9 .../k8s/features/BasicDriverFeatureStep.scala | 13 +++-- .../k8s/features/BasicExecutorFeatureStep.scala| 7 ++- .../k8s/features/BasicDriverFeatureStepSuite.scala | 63 -- .../features/BasicExecutorFeatureStepSuite.scala | 54 +++ .../spark/deploy/rest/mesos/MesosRestServer.scala | 5 +- .../cluster/mesos/MesosSchedulerUtils.scala| 9 ++-- .../deploy/rest/mesos/MesosRestServerSuite.scala | 8 ++- .../org/apache/spark/deploy/yarn/Client.scala | 14 +++-- .../apache/spark/deploy/yarn/YarnAllocator.scala | 5 +- .../spark/deploy/yarn/YarnSparkHadoopUtil.scala| 5 +- .../spark/deploy/yarn/YarnAllocatorSuite.scala | 29 ++ 15 files changed, 248 insertions(+), 35 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 5f37a1a..cf12174 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -636,7 +636,9 @@ private[spark] object SparkConf extends Logging { DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0", "Please use spark.excludeOnFailure.killExcludedExecutors"), DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", "3.1.0", -"Please use spark.yarn.executor.launch.excludeOnFailure.enabled") +"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"), + DeprecatedConfig("spark.kubernetes.memoryOverheadFactor", "3.3.0", +"Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor") ) Map(configs.map { cfg => (cfg.key -> cfg) } : _*) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index dbec61a..ffe4501 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -105,6 +105,22 @@ package object config { .bytesConf(ByteUnit.MiB) .createOptional + private[spark] val DRIVER_MEMORY_OVERHEAD_FACTOR = +ConfigBuilder("spark.driver.memoryOverheadFactor") + .doc("Fraction of driver memory to be allocated as additional non-heap memory per driver " + +"process in cluster mode. This is memory that accounts for things like VM overheads, " + +"interned strings, other native overheads, etc. This tends to grow with the container " + +"size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to " + +"0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks " + +"commonly fail with \"Memory Overhead Exceeded\" errors. This preempts this error " + +"with a higher default. This value is ignored if spark.driver.memoryOverhead is set " + +"directly.") + .ve
[spark] branch master updated: [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 71e2110 [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable 71e2110 is described below commit 71e2110b799220adc107c9ac5ce737281f2b65cc Author: Adam Binford AuthorDate: Wed Mar 16 10:54:18 2022 -0500 [SPARK-38194][YARN][MESOS][K8S] Make memory overhead factor configurable ### What changes were proposed in this pull request? Add a new config to set the memory overhead factor for drivers and executors. Currently the memory overhead is hard coded to 10% (except in Kubernetes), and the only way to set it higher is to set it to a specific memory amount. ### Why are the changes needed? In dynamic environments where different people or use cases need different memory requirements, it would be helpful to set a higher memory overhead factor instead of having to set a higher specific memory overhead value. The kubernetes resource manager already makes this configurable. This makes it configurable across the board. ### Does this PR introduce _any_ user-facing change? No change to default behavior, just adds a new config users can change. ### How was this patch tested? New UT to check the memory calculation. Closes #35504 from Kimahriman/yarn-configurable-memory-overhead-factor. Authored-by: Adam Binford Signed-off-by: Thomas Graves --- .../main/scala/org/apache/spark/SparkConf.scala| 4 +- .../org/apache/spark/internal/config/package.scala | 28 ++ docs/configuration.md | 30 ++- docs/running-on-kubernetes.md | 9 .../k8s/features/BasicDriverFeatureStep.scala | 13 +++-- .../k8s/features/BasicExecutorFeatureStep.scala| 7 ++- .../k8s/features/BasicDriverFeatureStepSuite.scala | 63 -- .../features/BasicExecutorFeatureStepSuite.scala | 54 +++ .../spark/deploy/rest/mesos/MesosRestServer.scala | 5 +- .../cluster/mesos/MesosSchedulerUtils.scala| 9 ++-- .../deploy/rest/mesos/MesosRestServerSuite.scala | 8 ++- .../org/apache/spark/deploy/yarn/Client.scala | 14 +++-- .../apache/spark/deploy/yarn/YarnAllocator.scala | 5 +- .../spark/deploy/yarn/YarnSparkHadoopUtil.scala| 5 +- .../spark/deploy/yarn/YarnAllocatorSuite.scala | 29 ++ 15 files changed, 248 insertions(+), 35 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 5f37a1a..cf12174 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -636,7 +636,9 @@ private[spark] object SparkConf extends Logging { DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0", "Please use spark.excludeOnFailure.killExcludedExecutors"), DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", "3.1.0", -"Please use spark.yarn.executor.launch.excludeOnFailure.enabled") +"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"), + DeprecatedConfig("spark.kubernetes.memoryOverheadFactor", "3.3.0", +"Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor") ) Map(configs.map { cfg => (cfg.key -> cfg) } : _*) diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index dbec61a..ffe4501 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -105,6 +105,22 @@ package object config { .bytesConf(ByteUnit.MiB) .createOptional + private[spark] val DRIVER_MEMORY_OVERHEAD_FACTOR = +ConfigBuilder("spark.driver.memoryOverheadFactor") + .doc("Fraction of driver memory to be allocated as additional non-heap memory per driver " + +"process in cluster mode. This is memory that accounts for things like VM overheads, " + +"interned strings, other native overheads, etc. This tends to grow with the container " + +"size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to " + +"0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks " + +"commonly fail with \"Memory Overhead Exceeded\" errors. This preempts this error " + +"with a higher default. This value is ignored if spark.driver.memoryOverhead is set " + +"directly.") + .version("3.3.0") + .doubleConf + .checkValue(factor => factor > 0, +"Ensure that memory overhead is
[spark] branch branch-3.3 updated: [SPARK-38567][INFRA][3.3] Enable GitHub Action build_and_test on branch-3.3
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 1ec220f [SPARK-38567][INFRA][3.3] Enable GitHub Action build_and_test on branch-3.3 1ec220f is described below commit 1ec220f029f90a6ab109ef87f7c17337038d91d3 Author: Max Gekk AuthorDate: Wed Mar 16 20:50:14 2022 +0900 [SPARK-38567][INFRA][3.3] Enable GitHub Action build_and_test on branch-3.3 ### What changes were proposed in this pull request? Like branch-3.2, this PR aims to update GitHub Action `build_and_test` in branch-3.3. ### Why are the changes needed? Currently, GitHub Action on branch-3.3 is not working. - https://github.com/apache/spark/commits/branch-3.3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #35876 from MaxGekk/fix-github-actions-3.3. Authored-by: Max Gekk Signed-off-by: Hyukjin Kwon --- .github/workflows/ansi_sql_mode_test.yml | 2 +- .github/workflows/build_and_test.yml | 32 +--- 2 files changed, 10 insertions(+), 24 deletions(-) diff --git a/.github/workflows/ansi_sql_mode_test.yml b/.github/workflows/ansi_sql_mode_test.yml index e68b04b..cc4ac57 100644 --- a/.github/workflows/ansi_sql_mode_test.yml +++ b/.github/workflows/ansi_sql_mode_test.yml @@ -22,7 +22,7 @@ name: ANSI SQL mode test on: push: branches: - - master + - branch-3.3 jobs: ansi_sql_test: diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index ebe17b5..7baabc7 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -23,20 +23,6 @@ on: push: branches: - '**' -- '!branch-*.*' - schedule: -# master, Hadoop 2 -- cron: '0 1 * * *' -# master -- cron: '0 4 * * *' -# branch-3.2 -- cron: '0 7 * * *' -# PySpark coverage for master branch -- cron: '0 10 * * *' -# Java 11 -- cron: '0 13 * * *' -# Java 17 -- cron: '0 16 * * *' workflow_call: inputs: ansi_enabled: @@ -96,7 +82,7 @@ jobs: echo '::set-output name=hadoop::hadoop3' else echo '::set-output name=java::8' - echo '::set-output name=branch::master' # Default branch to run on. CHANGE here when a branch is cut out. + echo '::set-output name=branch::branch-3.3' # Default branch to run on. CHANGE here when a branch is cut out. echo '::set-output name=type::regular' echo '::set-output name=envs::{"SPARK_ANSI_SQL_MODE": "${{ inputs.ansi_enabled }}"}' echo '::set-output name=hadoop::hadoop3' @@ -115,7 +101,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -325,7 +311,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -413,7 +399,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -477,7 +463,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -590,7 +576,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -639,7 +625,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -687,7 +673,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Apache Spark if: github.repository != 'apache/spark' run: | @@ -786,7 +772,7 @@ jobs: with: fetch-depth: 0 repository: apache/spark -ref: master +ref: branch-3.3 - name: Sync the current branch with the latest in Ap
[spark] branch master updated (8193b40 -> 1b41416)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8193b40 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 add 1b41416 [SPARK-38106][SQL] Use error classes in the parsing errors of functions No new revisions were added by this update. Summary of changes: .../spark/sql/errors/QueryParsingErrors.scala | 27 ++-- .../spark/sql/errors/QueryParsingErrorsSuite.scala | 172 + .../spark/sql/execution/command/DDLSuite.scala | 51 -- 3 files changed, 189 insertions(+), 61 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 6990320 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 6990320 is described below commit 69903200845b68a0474ecb0a3317dc744490c521 Author: Hyukjin Kwon AuthorDate: Wed Mar 16 18:20:50 2022 +0900 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 This PR upgrade Py4J 0.10.9.4, with relevant documentation changes. Py4J 0.10.9.3 has a resource leak issue when pinned thread mode is enabled - it's enabled by default in PySpark at https://github.com/apache/spark/commit/41af409b7bcfe1b3960274c0b3085bcc1f9d1c98. We worked around this by enforcing users to use `InheritableThread` or `inhteritable_thread_target` as a workaround. After upgrading, we don't need to enforce users anymore because it automatically cleans up, see also https://github.com/py4j/py4j/pull/471 Yes, users don't have to use `InheritableThread` or `inhteritable_thread_target` to avoid resource leaking problem anymore. CI in this PR should test it out. Closes #35871 from HyukjinKwon/SPARK-38563. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 8193b405f02f867439dd2d2017bf7b3c814b5cc8) Signed-off-by: Hyukjin Kwon --- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2.7-hive-2.3| 2 +- dev/deps/spark-deps-hadoop-3.2-hive-2.3| 2 +- docs/job-scheduling.md | 2 +- python/docs/Makefile | 2 +- python/docs/make2.bat | 2 +- python/docs/source/getting_started/install.rst | 2 +- python/lib/py4j-0.10.9.3-src.zip | Bin 42021 -> 0 bytes python/lib/py4j-0.10.9.4-src.zip | Bin 0 -> 42404 bytes python/pyspark/context.py | 6 ++-- python/pyspark/util.py | 33 - python/setup.py| 2 +- sbin/spark-config.sh | 2 +- 16 files changed, 20 insertions(+), 43 deletions(-) diff --git a/bin/pyspark b/bin/pyspark index 4840589..1e16c56 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.4-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index a19627a..f20c320 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.4-src.zip;%PYTHONPATH% set OLD_PYTHONSTARTUP=%PYTHONSTARTUP% set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py diff --git a/core/pom.xml b/core/pom.xml index 3833794..94b3e58 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -433,7 +433,7 @@ net.sf.py4j py4j - 0.10.9.3 + 0.10.9.4 org.apache.spark diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala index 8daba86..a9c35369 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala @@ -27,7 +27,7 @@ import org.apache.spark.SparkContext import org.apache.spark.api.java.{JavaRDD, JavaSparkContext} private[spark] object PythonUtils { - val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip" + val PY4J_ZIP_NAME = "py4j-0.10.9.4-src.zip" /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or from our JAR */ def sparkPythonPath: String = { diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index c2882bd..742710e 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -208,7 +208,7 @@ parquet-format-structures/1.12.2//parquet-format-structures-1.12.2.jar parquet-hadoop/1.12.2//parquet-hadoop-1.12.2.jar parquet-jackson/1.12.2//parquet-jackson-1.12.2.jar protobuf-java/2.5.0//protobuf-java-2.5.0.jar -py4j/0.10.9.3//py4j-0.10.9.3.j
[spark] branch master updated (8476c8b -> 8193b40)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8476c8b [SPARK-38542][SQL] UnsafeHashedRelation should serialize numKeys out add 8193b40 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 No new revisions were added by this update. Summary of changes: bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- docs/job-scheduling.md | 2 +- python/docs/Makefile | 2 +- python/docs/make2.bat | 2 +- python/docs/source/getting_started/install.rst | 2 +- python/lib/py4j-0.10.9.3-src.zip | Bin 42021 -> 0 bytes python/lib/py4j-0.10.9.4-src.zip | Bin 0 -> 42404 bytes python/pyspark/context.py | 6 ++-- python/pyspark/util.py | 35 +++-- python/setup.py| 2 +- sbin/spark-config.sh | 2 +- 16 files changed, 20 insertions(+), 45 deletions(-) delete mode 100644 python/lib/py4j-0.10.9.3-src.zip create mode 100644 python/lib/py4j-0.10.9.4-src.zip - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 3bbf346 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 3bbf346 is described below commit 3bbf346d9ca984faa0c3e67cd1387a13b2bd1e37 Author: Hyukjin Kwon AuthorDate: Wed Mar 16 18:20:50 2022 +0900 [SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4 ### What changes were proposed in this pull request? This PR upgrade Py4J 0.10.9.4, with relevant documentation changes. ### Why are the changes needed? Py4J 0.10.9.3 has a resource leak issue when pinned thread mode is enabled - it's enabled by default in PySpark at https://github.com/apache/spark/commit/41af409b7bcfe1b3960274c0b3085bcc1f9d1c98. We worked around this by enforcing users to use `InheritableThread` or `inhteritable_thread_target` as a workaround. After upgrading, we don't need to enforce users anymore because it automatically cleans up, see also https://github.com/py4j/py4j/pull/471 ### Does this PR introduce _any_ user-facing change? Yes, users don't have to use `InheritableThread` or `inhteritable_thread_target` to avoid resource leaking problem anymore. ### How was this patch tested? CI in this PR should test it out. Closes #35871 from HyukjinKwon/SPARK-38563. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 8193b405f02f867439dd2d2017bf7b3c814b5cc8) Signed-off-by: Hyukjin Kwon --- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- docs/job-scheduling.md | 2 +- python/docs/Makefile | 2 +- python/docs/make2.bat | 2 +- python/docs/source/getting_started/install.rst | 2 +- python/lib/py4j-0.10.9.3-src.zip | Bin 42021 -> 0 bytes python/lib/py4j-0.10.9.4-src.zip | Bin 0 -> 42404 bytes python/pyspark/context.py | 6 ++-- python/pyspark/util.py | 35 +++-- python/setup.py| 2 +- sbin/spark-config.sh | 2 +- 16 files changed, 20 insertions(+), 45 deletions(-) diff --git a/bin/pyspark b/bin/pyspark index 4840589..1e16c56 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -50,7 +50,7 @@ export PYSPARK_DRIVER_PYTHON_OPTS # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.4-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index a19627a..f20c320 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.4-src.zip;%PYTHONPATH% set OLD_PYTHONSTARTUP=%PYTHONSTARTUP% set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py diff --git a/core/pom.xml b/core/pom.xml index 9d3b170..953c76b 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -423,7 +423,7 @@ net.sf.py4j py4j - 0.10.9.3 + 0.10.9.4 org.apache.spark diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala index 8daba86..a9c35369 100644 --- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala +++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala @@ -27,7 +27,7 @@ import org.apache.spark.SparkContext import org.apache.spark.api.java.{JavaRDD, JavaSparkContext} private[spark] object PythonUtils { - val PY4J_ZIP_NAME = "py4j-0.10.9.3-src.zip" + val PY4J_ZIP_NAME = "py4j-0.10.9.4-src.zip" /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or from our JAR */ def sparkPythonPath: String = { diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 b/dev/deps/spark-deps-hadoop-2-hive-2.3 index bcbf8b9..f2db663 100644 --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 @@ -233,7 +233,7 @@ parquet-hadoop/1.12.2//parquet-hadoop-1.12.2.jar parq