[GitHub] [spark] itholic opened a new pull request #32853: [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side

GitBox Wed, 09 Jun 2021 22:08:45 -0700


itholic opened a new pull request #32853:
URL: https://github.com/apache/spark/pull/32853



   ### What changes were proposed in this pull request?
   
   This PR fix the wrong behavior of `Index.difference` in pandas APIs on 
Spark, based on the comment 
https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and 
https://github.com/databricks/koalas/pull/1325#discussion_r647890007
   - it couldn't handle the case properly when `self` is `Index` or 
`MultiIndex` and `other` is `MultiIndex` or `Index`.
   ```python
   >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 
'z', 3)])
   >>> idx1 = ps.Index([1, 2, 3])
   >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 
'z', 3)])
   >>> midx1.difference(idx1)
   pyspark.pandas.exceptions.PandasNotImplementedError: The method 
`pd.Index.__iter__()` is not implemented. If you want to collect your data as 
an NumPy array, use 'to_numpy()' instead.
   ```
   - it's collecting the all data into the driver side when the other is 
list-like objects, especially when the `other` is distributed object such as 
Series which is very dangerous.
   
   And added the related test cases.
   
   ### Why are the changes needed?
   
   To correct the incompatible behavior with pandas, and to prevent the case 
which potentially cause the OOM easily.
   
   ```python
   >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 
'z', 3)])
   >>> idx1 = ps.Index([1, 2, 3])
   >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 
'z', 3)])
   >>> midx1.difference(idx1)
   MultiIndex([('a', 'x', 1),
               ('b', 'z', 2),
               ('k', 'z', 3)],
              )
   ```
   
   And now it only using the for loop when the `other` is only the case `list`, 
`set` or `dict`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the previous bug is fixed as described in the above code examples.
   
   
   ### How was this patch tested?
   
   Manually tested with linter and unittest in local, and it might be passed on 
CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request #32853: [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side

Reply via email to