Re: [PR] [SEDONA-663] Support spark connect python api [sedona]

via GitHub Mon, 14 Oct 2024 09:36:12 -0700


jbampton commented on code in PR #1639:
URL: https://github.com/apache/sedona/pull/1639#discussion_r1799804104



##########
python/sedona/sql/dataframe_api.py:
##########
@@ -24,8 +24,23 @@
 from pyspark.sql import Column, SparkSession
 from pyspark.sql import functions as f
 
-ColumnOrName = Union[Column, str]
-ColumnOrNameOrNumber = Union[Column, str, float, int]
+try:
+    from pyspark.sql.connect.column import Column as ConnectColumn
+    from pyspark.sql.utils import is_remote
+except ImportError:
+    # be backwards compatible with spark < 3.4

Review Comment:
   ```suggestion
       # be backwards compatible with Spark < 3.4
   ```



##########
python/sedona/spark/SedonaContext.py:
##########
@@ -34,8 +41,11 @@ def create(cls, spark: SparkSession) -> SparkSession:
         :return: SedonaContext which is an instance of SparkSession
         """
         spark.sql("SELECT 1 as geom").count()
-        PackageImporter.import_jvm_lib(spark._jvm)
-        spark._jvm.SedonaContext.create(spark._jsparkSession, "python")
+
+        # with spark connect there is no local jvm

Review Comment:
   ```suggestion
           # with Spark Connect there is no local JVM
   ```



##########
python/sedona/sql/dataframe_api.py:
##########
@@ -86,6 +103,10 @@ def _get_type_list(annotated_type: Type) -> Tuple[Type, 
...]:
     else:
         valid_types = (annotated_type,)
 
+    # functions accepting a Column should also accept the spark connect sort 
of Column

Review Comment:
   ```suggestion
       # functions accepting a Column should also accept the Spark Connect sort 
of Column
   ```



##########
.github/workflows/python.yml:
##########
@@ -153,3 +153,20 @@ jobs:
           SPARK_VERSION: ${{ matrix.spark }}
           HADOOP_VERSION: ${{ matrix.hadoop }}
         run: (export 
SPARK_HOME=$PWD/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION};export 
PYTHONPATH=$SPARK_HOME/python;cd python;pipenv run pytest tests)
+      - env:
+          SPARK_VERSION: ${{ matrix.spark }}
+          HADOOP_VERSION: ${{ matrix.hadoop }}
+        run: |
+          if [ ! -f 
"spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}/sbin/start-connect-server.sh"
 ]
+          then
+            echo "Skipping connect tests for spark $SPARK_VERSION"

Review Comment:
   ```suggestion
               echo "Skipping connect tests for Spark $SPARK_VERSION"
   ```



##########
python/sedona/sql/dataframe_api.py:
##########
@@ -49,13 +64,15 @@ def call_sedona_function(
         )
 
     # apparently a Column is an Iterable so we need to check for it explicitly
-    if (
-        (not isinstance(args, Iterable))
-        or isinstance(args, str)
-        or isinstance(args, Column)
+    if (not isinstance(args, Iterable)) or isinstance(
+        args, (str, Column, ConnectColumn)
     ):
         args = [args]
 
+    # in spark-connect environments use connect api

Review Comment:
   ```suggestion
       # in spark-connect environments use connect API
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [SEDONA-663] Support spark connect python api [sedona]

Reply via email to