[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

BryanCutler Wed, 29 Aug 2018 13:13:46 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21546#discussion_r213818193
  
    --- Diff: python/pyspark/context.py ---
    @@ -494,10 +494,14 @@ def f(split, iterator):
                 c = list(c)    # Make it a list so we can compute its length
             batchSize = max(1, min(len(c) // numSlices, self._batchSize or 
1024))
             serializer = BatchedSerializer(self._unbatched_serializer, 
batchSize)
    -        jrdd = self._serialize_to_jvm(c, numSlices, serializer)
    +
    +        def reader_func(temp_filename):
    +            return self._jvm.PythonRDD.readRDDFromFile(self._jsc, 
temp_filename, numSlices)
    +
    +        jrdd = self._serialize_to_jvm(c, serializer, reader_func)
             return RDD(jrdd, self, serializer)
     
    -    def _serialize_to_jvm(self, data, parallelism, serializer):
    +    def _serialize_to_jvm(self, data, serializer, reader_func):
    --- End diff --
    
    Hey @squito , yes that's correct this is in the path that `ArrowTests` with 
`createDataFrame` tests. These tests are skipped if pyarrow is not installed, 
but for our Jenkins tests it is installed under the Python 3.5 env so it gets 
tested there.  
    
    It's a little subtle to see that they were run since the test output shows 
only when tests are skipped. You can see that for Python 2.7 `ArrowTests` show 
as skipped, but for 3.5 it does not.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

Reply via email to