HyukjinKwon opened a new pull request, #49297: URL: https://github.com/apache/spark/pull/49297
### What changes were proposed in this pull request? This PR proposes Pythonic approach of setting Spark SQL configurations as below: **Get/set the configurations** ```python >>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false" >>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] 'false' >>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled = "true" >>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled 'true' ``` **List sub configurations** ```python >>> dir(spark.conf["spark.sql.optimizer"]) ['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold'] >>> dir(spark.conf.spark.sql.optimizer) ['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold'] ``` **Get documentation from the configuration** ```python >>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"].desc() "Enables runtime group filtering for group-based row-level operations. Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten." ``` ```python >>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled.version() '3.4.0' ``` ### Why are the changes needed? To provide Pythonic ways of setting options. This is also supported in pandas as a reference (https://pandas.pydata.org/docs/user_guide/options.html) ### Does this PR introduce _any_ user-facing change? Yes, it provides users more Pythonic way of setting SQL configurations as demonstrated above. ### How was this patch tested? TBD ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
