itholic opened a new pull request #33634:
URL: https://github.com/apache/spark/pull/33634


   ### What changes were proposed in this pull request?
   
   This PR proposes fixing the `Index.union` to follow the behavior of pandas 
1.3.
   
   **NOTE**: We'd better NOT to follow the some behavior of pandas 1.3, since 
it is too expensive to count the all duplicated values since pandas-on-Spark 
deals with the large dataset.
   
   For example, if there are duplicate values in index, pandas chooses the 
number of duplicates from `self` or `other` with more duplicate values.
   
   ```python
   >>> pidx1 = pd.Index([1, 1, 1, 1, 1, 2, 2])
   >>> pidx2 = pd.Index([1, 1, 2, 2, 2, 2, 2])
   >>> pidx1.union(pidx2)
   Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
   ```
   
   Where as pandas-on-Spark always chooses the duplicates from `self`.
   
   ```python
   >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
   >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
   >>> ps_idx1.union(ps_idx2)
   Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
   ```
   
   So, if there are more duplicate values ​​in `self`, it behaves the same as 
pandas, as below.
   
   ```python
   >>> # pandas-on-Spark
   >>> ps_idx1 = ps.Index([1, 1, 1, 1, 2, 3, 3])
   >>> ps_idx2 = ps.Index([1, 1, 1, 2, 3, 3])
   >>> ps_idx1.union(psidx2)
   Int64Index([1, 1, 1, 1, 2, 3, 3], dtype='int64')
   >>> # pandas
   >>> psidx1.to_pandas().union(psidx2.to_pandas())
   Int64Index([1, 1, 1, 1, 2, 3, 3], dtype='int64')
   ```
   
   If we want to follow the pandas 100%, we need to count and compare all 
duplicate values ​​in `self` and `other`, which is extremely expensive (also it 
requires `has_duplicates` operation which internally performs the count 
operations twice for every case).
   
   
   ### Why are the changes needed?
   
   We should follow the behavior of pandas as much as possible.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the result for some cases will change.
   
   ### How was this patch tested?
   
   Fix the unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to