Norbert Schultz created SPARK-34115:
---------------------------------------
Summary: Long runtime on many environment variables
Key: SPARK-34115
URL: https://issues.apache.org/jira/browse/SPARK-34115
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.0
Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
Reporter: Norbert Schultz
I am not sure if this is a bug report or a feature request. The code is is the
same in current versions of Spark and maybe this ticket saves someone some time
for debugging.
We migrated some older code to Spark 2.4.0, and suddently the unit tests on our
build machine were much slower than expected.
On local machines it was running perfectly.
At the end it turned out, that Spark was wasting CPU Cycles during DataFrame
analyzing in the following functions
* AnalysisHelper.assertNotAnalysisRule calling
* Utils.isTesting
Utils.isTesting is traversing all environment variables.
The offending build machine was a Kubernetes Pod which automatically exposed
all services as environment variables, so it had more than 3000 environment
variables.
As Utils.isTesting is called very often throgh
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown,
transformUp).
Of course we will restrict the number of environment variables, on the other
side Utils.isTesting could also use a lazy val for
"sys.env.contains("SPARK_TESTING")" to not make it that expensive.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]