[GitHub] [spark] HyukjinKwon opened a new pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

GitBox Thu, 25 Jun 2020 04:16:26 -0700


HyukjinKwon opened a new pull request #28928:
URL: https://github.com/apache/spark/pull/28928



   ### What changes were proposed in this pull request?
   
   When you use floats are index of pandas, it creates a Spark DataFrame with a 
wrong results as below when Arrow is enabled:
   
   ```bash
   ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
   ```
   
   ```python
   >>> import pandas as pd
   >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 
4.])).show()
   +---+
   |  a|
   +---+
   |  1|
   |  1|
   |  2|
   +---+
   ```
   
   This is because direct slicing uses the value as index when the index 
contains floats:
   
   ```python
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
        a
   2.0  1
   3.0  2
   4.0  3
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
        a
   4.0  3
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
      a
   4  3
   ```
   
   This PR proposes to explicitly use `iloc` to positionally slide when we 
create a DataFrame from a pandas DataFrame with Arrow enabled.
   
   FWIW, I was trying to investigate why direct slicing refers the index value 
or the positional index sometimes but I stopped investigating further after 
reading this 
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection
   
   > While standard Python / Numpy expressions for selecting and setting are 
intuitive and come in handy for interactive work, for production code, we 
recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and 
`.iloc`.
   
   ### Why are the changes needed?
   
   To create the correct Spark DataFrame from a pandas DataFrame without a data 
loss.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it is a bug fix. 
   
   ```bash
   ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
   ```
   ```python
   import pandas as pd
   spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 
4.])).show()
   ```
   
   Before:
   
   ```python
   +---+
   |  a|
   +---+
   |  1|
   |  1|
   |  2|
   +---+```
   
   After:
   
   ```python
   +---+
   |  a|
   +---+
   |  1|
   |  2|
   |  3|
   +---+
   ```
   
   ### How was this patch tested?
   
   Manually tested and unittest were added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Reply via email to