[spark] branch master updated: [SPARK-37319][K8S][FOLLOWUP] Set JAVA_HOME for Java 17 installed by apt-get

2021-11-28 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3886ba  [SPARK-37319][K8S][FOLLOWUP] Set JAVA_HOME for Java 17 
installed by apt-get
a3886ba is described below

commit a3886ba976469bef0dfafc3da8686a53c5a59d95
Author: Kousuke Saruta 
AuthorDate: Sun Nov 28 21:44:42 2021 -0800

[SPARK-37319][K8S][FOLLOWUP] Set JAVA_HOME for Java 17 installed by apt-get

### What changes were proposed in this pull request?

This PR adds a configuration to `Dockerfile.java17` to set the environment 
variable `JAVA_HOME` for Java 17 installed by apt-get.

### Why are the changes needed?

In `entrypoint.sh`, `${JAVA_HOME}/bin/java` is used but the container build 
from `Dockerfile.java17` is not set the environment variable.
As a result, executors can't launch.
```
+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[]}" 
-Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp 
"$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH" 
org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBackend --driver-url 
$SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores 
$SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname 
$SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID --podName 
$SPARK_EXECUTOR_POD_NAME)
+ exec /usr/bin/tini -s -- /bin/java -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.at [...]
[FATAL tini (15)] exec /bin/java failed: No such file or directory
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed that the following simple job can run successfully with a 
container image build from the modified `Dockerfile.java17`.
```
$ bin/spark-shell --master k8s://https://: --conf 
spark.kubernetes.container.image=spark:
scala> spark.range(10).show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
```

Closes #34722 from sarutak/java17-home-kube.

Authored-by: Kousuke Saruta 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
index f9ab64e..96dd6c9 100644
--- 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
+++ 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
@@ -51,6 +51,7 @@ COPY kubernetes/tests /opt/spark/tests
 COPY data /opt/spark/data
 
 ENV SPARK_HOME /opt/spark
+ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64/
 
 WORKDIR /opt/spark/work-dir
 RUN chmod g+w /opt/spark/work-dir

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (db9a982 -> e91ef19)

2021-11-28 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from db9a982  [SPARK-37461][YARN] YARN-CLIENT mode client.appId is always 
null
 add e91ef19  [SPARK-37443][PYTHON] Provide a profiler for Python/Pandas 
UDFs

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py|  1 +
 python/docs/source/development/debugging.rst   | 56 -
 python/pyspark/context.py  | 10 ++-
 python/pyspark/context.pyi |  1 +
 python/pyspark/profiler.py | 45 +--
 python/pyspark/profiler.pyi| 17 +++-
 .../tests/test_udf_profiler.py}| 91 +++---
 python/pyspark/sql/udf.py  | 32 ++--
 .../sql/catalyst/expressions/Expression.scala  | 11 +++
 .../spark/sql/catalyst/expressions/PythonUDF.scala | 20 -
 .../catalyst/expressions/namedExpressions.scala| 10 ---
 .../apache/spark/sql/catalyst/util/package.scala   |  1 +
 .../apache/spark/sql/IntegratedUDFTestUtils.scala  | 35 -
 13 files changed, 251 insertions(+), 79 deletions(-)
 copy python/pyspark/{tests/test_profiler.py => sql/tests/test_udf_profiler.py} 
(52%)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-37461][YARN] YARN-CLIENT mode client.appId is always null

2021-11-28 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db9a982  [SPARK-37461][YARN] YARN-CLIENT mode client.appId is always 
null
db9a982 is described below

commit db9a982a1441810314be07e2c3b7cc77d1f1
Author: Angerszh 
AuthorDate: Sun Nov 28 08:53:25 2021 -0600

[SPARK-37461][YARN] YARN-CLIENT mode client.appId is always null

### What changes were proposed in this pull request?
In yarn-client mode, `Client.appId` variable is not assigned, it is always 
`null`,  in cluster mode, this variable will be assigned to the true value. In 
this patch, we assign true application id to `appId` too

### Why are the changes needed?

1. Refactor the code to avoid define different id in each function, we can 
just use this variable.
2. In client mode, user can use this value to get the application id.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manuel tested.

We have a internal proxy server to replace yarn tracking url, here use 
`appId`, with this patch it's not null.

```
21/11/26 12:38:44 INFO Client:
 client token: N/A
 diagnostics: AM container is launched, waiting for AM container to 
Register with RM
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: user_queue
 start time: 1637901520956
 final status: UNDEFINED
 tracking URL: 
http://internal-proxy-server/proxy?applicationId=application_1635856758535_4209064
 user: user_name
```

Closes #34710 from AngersZh/SPARK-37461.

Authored-by: Angerszh 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/deploy/yarn/Client.scala| 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 7787e2f..e6136fc 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -169,7 +169,6 @@ private[spark] class Client(
   def submitApplication(): ApplicationId = {
 ResourceRequestHelper.validateResources(sparkConf)
 
-var appId: ApplicationId = null
 try {
   launcherBackend.connect()
   yarnClient.init(hadoopConf)
@@ -181,7 +180,7 @@ private[spark] class Client(
   // Get a new application from our RM
   val newApp = yarnClient.createApplication()
   val newAppResponse = newApp.getNewApplicationResponse()
-  appId = newAppResponse.getApplicationId()
+  this.appId = newAppResponse.getApplicationId()
 
   // The app staging dir based on the STAGING_DIR configuration if 
configured
   // otherwise based on the users home directory.
@@ -207,8 +206,7 @@ private[spark] class Client(
   yarnClient.submitApplication(appContext)
   launcherBackend.setAppId(appId.toString)
   reportLauncherState(SparkAppHandle.State.SUBMITTED)
-
-  appId
+  this.appId
 } catch {
   case e: Throwable =>
 if (stagingDirPath != null) {
@@ -915,7 +913,6 @@ private[spark] class Client(
   private def createContainerLaunchContext(newAppResponse: 
GetNewApplicationResponse)
 : ContainerLaunchContext = {
 logInfo("Setting up container launch context for our AM")
-val appId = newAppResponse.getApplicationId
 val pySparkArchives =
   if (sparkConf.get(IS_PYTHON_APP)) {
 findPySparkArchives()
@@ -971,7 +968,7 @@ private[spark] class Client(
 if (isClusterMode) {
   sparkConf.get(DRIVER_JAVA_OPTIONS).foreach { opts =>
 javaOpts ++= Utils.splitCommandString(opts)
-  .map(Utils.substituteAppId(_, appId.toString))
+  .map(Utils.substituteAppId(_, this.appId.toString))
   .map(YarnSparkHadoopUtil.escapeForShell)
   }
   val libraryPaths = Seq(sparkConf.get(DRIVER_LIBRARY_PATH),
@@ -996,7 +993,7 @@ private[spark] class Client(
   throw new SparkException(msg)
 }
 javaOpts ++= Utils.splitCommandString(opts)
-  .map(Utils.substituteAppId(_, appId.toString))
+  .map(Utils.substituteAppId(_, this.appId.toString))
   .map(YarnSparkHadoopUtil.escapeForShell)
   }
   sparkConf.get(AM_LIBRARY_PATH).foreach { paths =>
@@ -1269,7 +1266,7 @@ private[spark] class Client(
* throw an appropriate SparkException.
*/
   def run(): Unit = {
-this.appId = submitApplication()
+submitApplication()
 if (!launcherBackend.isConnected() && fireAndForget) {
   val report = getApplicationReport(appId)
   val state =