Mihailo Timotic created SPARK-55928:
---------------------------------------
Summary: New linter for config effectiveness in views, UDFs and
procedures
Key: SPARK-55928
URL: https://issues.apache.org/jira/browse/SPARK-55928
Project: Spark
Issue Type: Documentation
Components: SQL
Affects Versions: 4.2.0
Reporter: Mihailo Timotic
Summary
Introduce a ConfigBindingPolicy framework in Apache Spark that enforces all
newly added configurations to explicitly declare how their values are bound
when used within SQL views, UDFs, or procedures. This replaces the manually
maintained hardcoded RETAINED_ANALYSIS_FLAGS allowlist in the Analyzer with a
dynamic, policy-driven approach.
Background: Conf + views mechanics
There are 3 ways Spark configs can interact with views:
1. The conf value is stored with a view/UDF/procedure on creation and is
applied on read. Session value is deprioritized. Example: ANSI conf, timezone.
2. The conf is not stored with a view, but its value is propagated through a
view from the active session. Example: kill-switches, feature flags.
3. The conf is neither stored with a view, nor propagated through a view. This
is the historical default in Spark.
The confusion arises for configurations that are not captured on
view/UDF/procedure creation, but still need to be used when querying them. The
common assumption is that if a conf is not preserved upon creation, its value
inside the view/UDF/procedure will be whatever the value is in the currently
active session. This is NOT true.
If a conf is not preserved on creation, its value when querying the
view/UDF/procedure will be:
- The value from the currently active session, only if the conf is in a
hardcoded allowlist (RETAINED_ANALYSIS_FLAGS in Analyzer.scala).
- The Spark default otherwise.
This allowlist is extremely non-obvious and easy to forget about. This has
caused regressions in the past where new configs affecting query semantics were
not added to the allowlist, causing views and UDFs to silently use Spark
defaults instead of session values.
Problem
The Analyzer.RETAINED_ANALYSIS_FLAGS list is a manually maintained hardcoded
allowlist of configs that should propagate from the active session when
resolving views and SQL UDFs. This approach is error-prone: developers adding
new configs that affect query semantics can easily forget to add them to this
list, causing subtle bugs where views and UDFs silently use Spark defaults
instead of session values. There is no automated enforcement to catch missing
entries. Even within analysis, Spark can trigger a Spark job recursively which
would potentially reference any config (for example, this is needed for schema
inference), so the scope of affected configs is broader than it appears.
Proposed Solution
Introduce a ConfigBindingPolicy enum and require all newly added configs to
explicitly declare a binding policy. This forces developers to think about how
their config interacts with views, UDFs, and procedures at definition time.
The enum has three values:
- SESSION: The config value propagates from the active session to
views/UDFs/procedures. This is the most common policy. Use it for feature flags
or bugfix kill-switches where uniform behavior across the entire query is
desired. Think about it this way: if you make a behavior change and roll it out
on by default, then discover a bug and need to revert it -- unless the policy
is SESSION, existing views will still have the old behavior baked in. Examples:
plan change logging (spark.sql.planChangeLog.level), bugfixes
(spark.sql.analyzer.preferColumnOverLcaInArrayIndex).
- PERSISTED: The config uses the value saved at view/UDF/procedure creation
time, or the Spark default if none was saved. Use for configs that carry view
semantic meaning that should be consistent regardless of session changes. A
good example is ANSI mode -- views created with ANSI off should always have
ANSI off, regardless of the session value.
- NOT_APPLICABLE: The config does not interact with view/UDF/procedure
resolution at all. Only choose this if you are confident the config doesn't
interact with view/UDF/procedure analysis. If accessed at runtime, it behaves
the same as SESSION. Examples: UI confs, server confs.
The hardcoded RETAINED_ANALYSIS_FLAGS list is replaced with a dynamic lookup
that retains all configs with SESSION or NOT_APPLICABLE binding policy when
resolving views and SQL UDFs. Configs that were previously in the hardcoded
list are annotated with withBindingPolicy(SESSION).
A new enforcement test fails if any newly added config does not declare a
bindingPolicy. Existing configs without a binding policy have been
grandfathered into an exceptions allowlist. The long-term goal is to have all
configs declare a binding policy and remove the exceptions allowlist entirely.
Why are all confs affected by the linter? Even within analysis, Spark can
trigger a Spark job recursively which would potentially reference any conf (for
example, this is needed for schema inference). The linter is active for all
newly added confs regardless of whether they directly interact with view
analysis.
Why not fix all existing confs? Currently there are over a thousand distinct
configs in Spark. Fixing every single conf would introduce behavior changes.
The linter only enforces the policy on new additions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]