Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21546#discussion_r213818193
--- Diff: python/pyspark/context.py ---
@@ -494,10 +494,14 @@ def f(split, iterator):
c = list(c) # Make it a list so we can compute its length
batchSize = max(1, min(len(c) // numSlices, self._batchSize or
1024))
serializer = BatchedSerializer(self._unbatched_serializer,
batchSize)
- jrdd = self._serialize_to_jvm(c, numSlices, serializer)
+
+ def reader_func(temp_filename):
+ return self._jvm.PythonRDD.readRDDFromFile(self._jsc,
temp_filename, numSlices)
+
+ jrdd = self._serialize_to_jvm(c, serializer, reader_func)
return RDD(jrdd, self, serializer)
- def _serialize_to_jvm(self, data, parallelism, serializer):
+ def _serialize_to_jvm(self, data, serializer, reader_func):
--- End diff --
Hey @squito , yes that's correct this is in the path that `ArrowTests` with
`createDataFrame` tests. These tests are skipped if pyarrow is not installed,
but for our Jenkins tests it is installed under the Python 3.5 env so it gets
tested there.
It's a little subtle to see that they were run since the test output shows
only when tests are skipped. You can see that for Python 2.7 `ArrowTests` show
as skipped, but for 3.5 it does not.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]