Liang Zhang created SPARK-32479:
-----------------------------------
Summary: Fix the slicing logic in createDataFrame when converting
pandas dataframe to arrow table
Key: SPARK-32479
URL: https://issues.apache.org/jira/browse/SPARK-32479
Project: Spark
Issue Type: Story
Components: PySpark
Affects Versions: 3.1.0
Reporter: Liang Zhang
h1. Problem:
In
[https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418|https://github.com/databricks/runtime/blob/84a952313ae73e3df32f065eb00cc0bcb024af14/python/pyspark/sql/pandas/conversion.py#L418,]
, the slicing logic may result in less partitions than specified.
h1. Example:
Assume:
```
length = 100 -> [0, 1, ..., 99]
num_slices = 99 = self.sparkContext.defaultParallelism
```
Old method: step = math.ceil(length / num_slices) = 2
start = i * step, end = (i + 1) * step:
output: [0,1] [2,3] [4,5] ... [98,99] -> 50 slices != num_slices
h1. Solution:
We can use a silimar logic as in
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L125]
{code:java}
// replace conversion.py#L418
pdf_slices = (pdf.iloc[i * length // num_slices: (i + 1) * length //
num_slices] for i in xrange(0, num_slices))
{code}
New method:
start = i * length // num_slices, end = (i + 1) * length // num_slices:
output: [0] [1] [2] ... [98,99] -> 99 slices
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]