itholic commented on a change in pull request #33634:
URL: https://github.com/apache/spark/pull/33634#discussion_r683033391



##########
File path: python/pyspark/pandas/indexes/base.py
##########
@@ -2235,6 +2235,24 @@ def union(
         """
         Form the union of two Index objects.
 
+        .. note:: For duplicated values, pandas chooses the number of 
duplicates of self or other
+            with more duplicates. But counting all duplicates is very 
expensive for large data,
+            so pandas-on-Spark always chooses the number of duplicates in self.

Review comment:
       They mentioned it's bug fix in their release note at 
https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.3.0.html#indexing.
   
   Let me try the solution @ueshin commented 
https://github.com/apache/spark/pull/33634#discussion_r682859453




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to