[GitHub] [spark] JoshRosen commented on a diff in pull request #36683: [SPARK-39301][SQL][PYTHON] Leverage LocalRelation in createDataFrame with Arrow optimization

GitBox Thu, 26 May 2022 15:04:39 -0700


JoshRosen commented on code in PR #36683:
URL: https://github.com/apache/spark/pull/36683#discussion_r883115780



##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -613,16 +613,16 @@ def _create_from_pandas_with_arrow(
 
         @no_type_check
         def reader_func(temp_filename):
-            return 
self._jvm.PythonSQLUtils.readArrowStreamFromFile(jsparkSession, temp_filename)
+            return 
self._jvm.PythonSQLUtils.readArrowStreamFromFile(temp_filename)
 
         @no_type_check
         def create_RDD_server():
-            return self._jvm.ArrowRDDServer(jsparkSession)
+            return self._jvm.ArrowIteratorServer()
 
         # Create Spark DataFrame from Arrow stream file, using one batch per 
partition
-        jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
create_RDD_server)
+        jiter = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
create_RDD_server)

Review Comment:
   Here it looks like we have changed the type of `reader_func`: before it was 
a function from filename to `JavaRDD[Array[Byte]]`, whereas now it's a function 
from filename to `Iterator[Array[Byte]]`.
   
   I think this breaks the current interface contract of `_serialize_to_jvm` in 
a way that will cause problems if IO encryption is enabled. According to [that 
method's 
docs](https://github.com/apache/spark/blob/dbde77856d2e51ff502a7fc1dba7f10316c2211b/python/pyspark/context.py#L677),
 `reader_func` is only used when IO encryption is disabled (which is the 
default configuration). If IO encryption is enabled then this codepath will use 
the `PythonParallelizeServer` codepath which will still produce a 
`JavaRDD[Array[Byte]]`.
   
   To fix this, we might need to change the other pieces of this interface so 
that they all pass around `Iterator[Array[Byte]]` (in other words, modify the 
regular Python parallelize codepath so that its implementation is closer to 
what you have here).
   
   Good catch about the performance risks of passing `Array[Array[Byte]]` back 
to Python via Py4J, BTW. Do you know if `ArrayBuffer` has the same issue? I'm 
wondering whether we could pass `ArrayBuffer[Array[Byte]]` if we want to ensure 
that the returned object is still considered `Iterable` from Python.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen commented on a diff in pull request #36683: [SPARK-39301][SQL][PYTHON] Leverage LocalRelation in createDataFrame with Arrow optimization

Reply via email to