[PR] [DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations [spark]

via GitHub Wed, 25 Dec 2024 22:59:01 -0800


HyukjinKwon opened a new pull request, #49297:
URL: https://github.com/apache/spark/pull/49297


   ### What changes were proposed in this pull request?
   
   This PR proposes Pythonic approach of setting Spark SQL configurations as 
below:
   
   **Get/set the configurations**
   
   ```python
   >>> 
spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] 
= "false"
   >>> 
spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
   'false'
   >>> 
spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled = 
"true"
   >>> 
spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled
   'true'
   ```
   
   
   **List sub configurations**
   
   ```python
   >>> dir(spark.conf["spark.sql.optimizer"])
   ['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 
'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 
'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 
'runtime.bloomFilter.applicationSideScanSizeThreshold', 
'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 
'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 
'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 
'runtime.rowLevelOperationGroupFilter.enabled', 
'runtimeFilter.number.threshold']
   >>> dir(spark.conf.spark.sql.optimizer)
   ['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 
'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 
'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 
'runtime.bloomFilter.applicationSideScanSizeThreshold', 
'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 
'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 
'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 
'runtime.rowLevelOperationGroupFilter.enabled', 
'runtimeFilter.number.threshold']
   ```
   
   **Get documentation from the configuration**
   
   ```python
   >>> 
spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"].desc()
   "Enables runtime group filtering for group-based row-level operations. Data 
sources that replace groups of data (e.g. files, partitions) may prune entire 
groups using provided data source filters when planning a row-level operation 
scan. However, such filtering is limited as not all expressions can be 
converted into data source filters and some expressions can only be evaluated 
by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can 
execute a query at runtime to find what records match the condition of the 
row-level operation. The information about matching records will be passed back 
to the row-level operation scan, allowing data sources to discard groups that 
don't have to be rewritten."
   ```
   
   ```python
   >>> 
spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled.version()
   '3.4.0'
   ```
   
   
   ### Why are the changes needed?
   
   To provide Pythonic ways of setting options. This is also supported in 
pandas as a reference (https://pandas.pydata.org/docs/user_guide/options.html)
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it provides users more Pythonic way of setting SQL configurations as 
demonstrated above.
   
   ### How was this patch tested?
   
   TBD
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations [spark]

Reply via email to