Mihailo Timotic created SPARK-55928:
---------------------------------------

             Summary: New linter for config effectiveness in views, UDFs and 
procedures
                 Key: SPARK-55928
                 URL: https://issues.apache.org/jira/browse/SPARK-55928
             Project: Spark
          Issue Type: Documentation
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Mihailo Timotic


Summary
 
Introduce a ConfigBindingPolicy framework in Apache Spark that enforces all 
newly added configurations to explicitly declare how their values are bound 
when used within SQL views, UDFs, or procedures. This replaces the manually 
maintained hardcoded RETAINED_ANALYSIS_FLAGS allowlist in the Analyzer with a 
dynamic, policy-driven approach.
 
 
Background: Conf + views mechanics
 
There are 3 ways Spark configs can interact with views:
 
1. The conf value is stored with a view/UDF/procedure on creation and is 
applied on read. Session value is deprioritized. Example: ANSI conf, timezone.
 
2. The conf is not stored with a view, but its value is propagated through a 
view from the active session. Example: kill-switches, feature flags.
 
3. The conf is neither stored with a view, nor propagated through a view. This 
is the historical default in Spark.
 
The confusion arises for configurations that are not captured on 
view/UDF/procedure creation, but still need to be used when querying them. The 
common assumption is that if a conf is not preserved upon creation, its value 
inside the view/UDF/procedure will be whatever the value is in the currently 
active session. This is NOT true.
 
If a conf is not preserved on creation, its value when querying the 
view/UDF/procedure will be:
- The value from the currently active session, only if the conf is in a 
hardcoded allowlist (RETAINED_ANALYSIS_FLAGS in Analyzer.scala).
- The Spark default otherwise.
 
This allowlist is extremely non-obvious and easy to forget about. This has 
caused regressions in the past where new configs affecting query semantics were 
not added to the allowlist, causing views and UDFs to silently use Spark 
defaults instead of session values.
 
 
Problem
 
The Analyzer.RETAINED_ANALYSIS_FLAGS list is a manually maintained hardcoded 
allowlist of configs that should propagate from the active session when 
resolving views and SQL UDFs. This approach is error-prone: developers adding 
new configs that affect query semantics can easily forget to add them to this 
list, causing subtle bugs where views and UDFs silently use Spark defaults 
instead of session values. There is no automated enforcement to catch missing 
entries. Even within analysis, Spark can trigger a Spark job recursively which 
would potentially reference any config (for example, this is needed for schema 
inference), so the scope of affected configs is broader than it appears.
 
 
Proposed Solution
 
Introduce a ConfigBindingPolicy enum and require all newly added configs to 
explicitly declare a binding policy. This forces developers to think about how 
their config interacts with views, UDFs, and procedures at definition time.
 
The enum has three values:
 
- SESSION: The config value propagates from the active session to 
views/UDFs/procedures. This is the most common policy. Use it for feature flags 
or bugfix kill-switches where uniform behavior across the entire query is 
desired. Think about it this way: if you make a behavior change and roll it out 
on by default, then discover a bug and need to revert it -- unless the policy 
is SESSION, existing views will still have the old behavior baked in. Examples: 
plan change logging (spark.sql.planChangeLog.level), bugfixes 
(spark.sql.analyzer.preferColumnOverLcaInArrayIndex).
 
- PERSISTED: The config uses the value saved at view/UDF/procedure creation 
time, or the Spark default if none was saved. Use for configs that carry view 
semantic meaning that should be consistent regardless of session changes. A 
good example is ANSI mode -- views created with ANSI off should always have 
ANSI off, regardless of the session value.
 
- NOT_APPLICABLE: The config does not interact with view/UDF/procedure 
resolution at all. Only choose this if you are confident the config doesn't 
interact with view/UDF/procedure analysis. If accessed at runtime, it behaves 
the same as SESSION. Examples: UI confs, server confs.
 
The hardcoded RETAINED_ANALYSIS_FLAGS list is replaced with a dynamic lookup 
that retains all configs with SESSION or NOT_APPLICABLE binding policy when 
resolving views and SQL UDFs. Configs that were previously in the hardcoded 
list are annotated with withBindingPolicy(SESSION).
 
A new enforcement test fails if any newly added config does not declare a 
bindingPolicy. Existing configs without a binding policy have been 
grandfathered into an exceptions allowlist. The long-term goal is to have all 
configs declare a binding policy and remove the exceptions allowlist entirely.
 
Why are all confs affected by the linter? Even within analysis, Spark can 
trigger a Spark job recursively which would potentially reference any conf (for 
example, this is needed for schema inference). The linter is active for all 
newly added confs regardless of whether they directly interact with view 
analysis.
 
Why not fix all existing confs? Currently there are over a thousand distinct 
configs in Spark. Fixing every single conf would introduce behavior changes. 
The linter only enforces the policy on new additions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to