[GitHub] [spark] WeichenXu123 commented on a change in pull request #29284: [SPARK-32479][PYSPARK] Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table

GitBox Thu, 30 Jul 2020 00:27:26 -0700


WeichenXu123 commented on a change in pull request #29284:
URL: https://github.com/apache/spark/pull/29284#discussion_r462672906




##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -404,8 +404,10 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
                            for t in pdf.dtypes]
 
         # Slice the DataFrame to be batched
-        step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round 
int up
-        pdf_slices = (pdf.iloc[start:start + step] for start in range(0, 
len(pdf), step))
+        length = len(pdf)
+        num_slices = self.sparkContext.defaultParallelism
+        pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // 
num_slices]
+                      for i in range(0, num_slices))
 

Review comment:
       let's use
   ```
   start_end_pairs = ((i * length // num_slices, (i + 1) * length // 
num_slices) for i in range(0, num_slices))
   pdf_slices = (pdf.iloc[start, end] for start, end in start_end_pairs if 
start < end)
   ```
   to filter out empty slices (which my leads to error). 

##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -404,8 +404,10 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
                            for t in pdf.dtypes]
 
         # Slice the DataFrame to be batched
-        step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round 
int up
-        pdf_slices = (pdf.iloc[start:start + step] for start in range(0, 
len(pdf), step))
+        length = len(pdf)
+        num_slices = self.sparkContext.defaultParallelism
+        pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length // 
num_slices]
+                      for i in range(0, num_slices))
 

Review comment:
       let's use
   ```
   start_end_pairs = ((i * length // num_slices, (i + 1) * length // 
num_slices) for i in range(0, num_slices))
   pdf_slices = (pdf.iloc[start, end] for start, end in start_end_pairs if 
start < end)
   ```
   to filter out empty slices (it will happen when len(pdf) <  num_slices, 
which my leads to error). 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WeichenXu123 commented on a change in pull request #29284: [SPARK-32479][PYSPARK] Fix the slicing logic in createDataFrame when converting pandas dataframe to arrow table

Reply via email to