itholic opened a new pull request #33634: URL: https://github.com/apache/spark/pull/33634
### What changes were proposed in this pull request? This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. **NOTE**: We'd better NOT to follow the some behavior of pandas 1.3, since it is too expensive to count the all duplicated values since pandas-on-Spark deals with the large dataset. For example, if there are duplicate values in index, pandas chooses the number of duplicates from `self` or `other` with more duplicate values. ```python >>> pidx1 = pd.Index([1, 1, 1, 1, 1, 2, 2]) >>> pidx2 = pd.Index([1, 1, 2, 2, 2, 2, 2]) >>> pidx1.union(pidx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` Where as pandas-on-Spark always chooses the duplicates from `self`. ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` So, if there are more duplicate values in `self`, it behaves the same as pandas, as below. ```python >>> # pandas-on-Spark >>> ps_idx1 = ps.Index([1, 1, 1, 1, 2, 3, 3]) >>> ps_idx2 = ps.Index([1, 1, 1, 2, 3, 3]) >>> ps_idx1.union(psidx2) Int64Index([1, 1, 1, 1, 2, 3, 3], dtype='int64') >>> # pandas >>> psidx1.to_pandas().union(psidx2.to_pandas()) Int64Index([1, 1, 1, 1, 2, 3, 3], dtype='int64') ``` If we want to follow the pandas 100%, we need to count and compare all duplicate values in `self` and `other`, which is extremely expensive (also it requires `has_duplicates` operation which internally performs the count operations twice for every case). ### Why are the changes needed? We should follow the behavior of pandas as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, the result for some cases will change. ### How was this patch tested? Fix the unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
