xinrong-databricks commented on a change in pull request #33998:
URL: https://github.com/apache/spark/pull/33998#discussion_r712536379



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -9974,13 +9974,25 @@ def filter(
                 raise ValueError("items should be a list-like object.")
             if axis == 0:
                 if len(index_scols) == 1:
-                    col = None
-                    for item in items:
-                        if col is None:
-                            col = index_scols[0] == SF.lit(item)
-                        else:
-                            col = col | (index_scols[0] == SF.lit(item))
-                elif len(index_scols) > 1:
+                    if len(items) <= ps.get_option("compute.isin_limit"):
+                        col = index_scols[0].isin([SF.lit(item) for item in 
items])
+                        return DataFrame(self._internal.with_filter(col))
+                    else:
+                        item_sdf_col = verify_temp_column_name(

Review comment:
       The point is to use `semi` join, and we try broadcasting in case the 
`item_sdf` is small for efficiency.
   
   As we discussed in 
https://github.com/apache/spark/pull/33964#issuecomment-921943161,
   
   Broadcast DF (Join) performs the best compared to both `isin` and the long 
selection(the original logic) above the `compute.isin_limit`.
   
   That's why we want to switch to Broadcast.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to