xinrong-databricks opened a new pull request #33982: URL: https://github.com/apache/spark/pull/33982
### What changes were proposed in this pull request? Introduce the 'compute.isin_limit' option, with the default value of 80. ### Why are the changes needed? `Column.isin(list)` doesn't perform well when the given `list` is large, as https://issues.apache.org/jira/browse/SPARK-33383. Thus, 'compute.isin_limit' is introduced to constrain the usage of `Column.isin(list)` in the code base. If the length of the ‘list’ is above the `'compute.isin_limit'`, broadcast join is used instead for better performance. #### Why is the default value 80? After reproducing the benchmark mentioned in https://issues.apache.org/jira/browse/SPARK-33383, | length of filtering list | isin time /ms| broadcast DF time / ms| | :---: | :-: | :-: | | 200 | 69411 | 39296 | | 100 | 43074 | 40087 | | 80 | 35592 | 40350 | | 50 | 28134 | 37847 | We may notice when the length of the filtering list <= 80, the `isin` approach performs better than `broadcast DF`. ### Does this PR introduce _any_ user-facing change? Users may read/write the value of `'compute.isin_limit'` as follows ```py >>> ps.get_option('compute.isin_limit') 80 >>> ps.set_option('compute.isin_limit', 10) >>> ps.get_option('compute.isin_limit') 10 >>> ps.set_option('compute.isin_limit', -1) ... ValueError: 'compute.isin_limit' should be greater than or equal to 0. >>> ps.reset_option('compute.isin_limit') >>> ps.get_option('compute.isin_limit') 80 ``` ### How was this patch tested? Manual test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
