Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

via GitHub Tue, 30 Apr 2024 00:48:34 -0700


HyukjinKwon commented on code in PR #46298:
URL: https://github.com/apache/spark/pull/46298#discussion_r1584288878



##########
.github/workflows/build_python_connect35.yml:
##########
@@ -0,0 +1,135 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: Build / Spark Connect Python-only (master-server, 35-client, Python 3.11)
+
+on:
+  push:
+    branches:
+    - '**'
+
+jobs:
+  # Build: build Spark and run the tests for specified modules using SBT
+  build:
+    name: "Build modules: pyspark-connect"
+    runs-on: ubuntu-latest
+    timeout-minutes: 300
+    steps:
+      - name: Checkout Spark repository
+        uses: actions/checkout@v4
+      - name: Cache SBT and Maven
+        uses: actions/cache@v4
+        with:
+          path: |
+            build/apache-maven-*
+            build/*.jar
+            ~/.sbt
+          key: build-spark-connect-python-only-${{ hashFiles('**/pom.xml', 
'project/build.properties', 'build/mvn', 'build/sbt', 
'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
+          restore-keys: |
+            build-spark-connect-python-only-
+      - name: Cache Coursier local repository
+        uses: actions/cache@v4
+        with:
+          path: ~/.cache/coursier
+          key: coursier-build-spark-connect-python-only-${{ 
hashFiles('**/pom.xml') }}
+          restore-keys: |
+            coursier-build-spark-connect-python-only-
+      - name: Install Java 17
+        uses: actions/setup-java@v4
+        with:
+          distribution: zulu
+          java-version: 17
+      - name: Install Python 3.11
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+          architecture: x64
+      - name: Build Spark
+        run: |
+          ./build/sbt -Phive Test/package
+      - name: Install Python dependencies
+        run: |
+          pip install -r dev/requirements.txt
+      - name: Run tests
+        env:
+          SPARK_TESTING: 1
+          SPARK_CONNECT_TESTING_REMOTE: sc://localhost
+        run: |
+          # Make less noisy
+          cp conf/log4j2.properties.template conf/log4j2.properties
+          sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' 
conf/log4j2.properties
+
+          # Start a Spark Connect server for local
+          
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh \
+            --driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" \
+            --jars "`find connector/connect/server/target -name 
spark-connect-*SNAPSHOT.jar`,`find connector/protobuf/target -name 
spark-protobuf-*SNAPSHOT.jar`,`find connector/avro/target -name 
spark-avro*SNAPSHOT.jar`"
+
+          
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 python -c "from pyspark.sql import SparkSession; _ = 
SparkSession.builder.remote('sc://localhost').getOrCreate().range(100).repartition(100).mapInPandas(lambda
 x: x, 'id INT').collect()"
+
+          # Checkout to branch-3.5 to use the tests in branch-3.5.
+          git checkout branch-3.5
+
+          # Remove Py4J and PySpark zipped library to make sure it uses Spark 
3.5 ones from the source.
+          mv python/lib lib.back
+          mv python/pyspark pyspark.back
+
+          # Several tests related to catalog requires to run them 
sequencially, e.g., writing a table in a listener.
+          # Run branch-3.5 tests
+          ./python/run-tests --parallelism=1 --python-executables=python3 
--modules pyspark-connect,pyspark-ml-connect
+          # None of tests are dependent on each other in Pandas API on Spark 
so run them in parallel
+          ./python/run-tests --parallelism=4 --python-executables=python3 
--modules 
pyspark-pandas-connect-part0,pyspark-pandas-connect-part1,pyspark-pandas-connect-part2,pyspark-pandas-connect-part3

Review Comment:
   Yeah .. it's a good point. I did that by moving `py4j` out of the Python 
path (it is very hacky and complicated to follow) at least but would be great 
to have a better way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48054][PYTHON][CONNECT][INFRA] Backward compatibility test for Spark Connect [spark]

Reply via email to