dtenedor opened a new pull request, #52987:
URL: https://github.com/apache/spark/pull/52987
### What changes were proposed in this pull request?
This PR adds a new SQL configuration
`spark.sql.pipeOperator.allowAggregateInSelect` (default: `true`) that allows
aggregate functions to be used in pipe operator clauses such as `|> SELECT` and
`|> EXTEND` without requiring the explicit `|> AGGREGATE` keyword.
**Key changes:**
1. **New Configuration** (`SQLConf.scala`):
- Added `PIPE_OPERATOR_ALLOW_AGGREGATE_IN_SELECT` configuration (default:
`true`)
- When enabled, aggregate functions can be used in any pipe operator
clause
- When disabled, aggregate functions must use the `|> AGGREGATE` clause
exclusively
2. **Updated Validation Logic** (`pipeOperators.scala`):
- Converted `ValidateAndStripPipeExpressions` from an object to a case
class accepting the configuration
- Modified validation to conditionally check for aggregates based on the
configuration value
3. **Analyzer Integration** (`Analyzer.scala`):
- Updated to pass `conf.pipeOperatorAllowAggregateInSelect` to the
validation rule
4. **Comprehensive Test Coverage** (`pipe-operators.sql`):
- Added tests for aggregates in `|> SELECT` and `|> EXTEND`
- Tests for chaining, GROUP BY, and configuration toggling
- Verified that `|> AGGREGATE` continues to work
- Confirmed invalid queries (e.g., aggregates in WHERE) still fail
appropriately
**Example queries now supported:**
-- Aggregate in SELECT
table employees |> select sum(salary) as total_salary;
-- Aggregate in EXTEND
table sales |> extend avg(amount) as avg_amount;
-- Aggregate with GROUP BY
table orders |> select customer_id, count(*) as order_count group by
customer_id;
-- Chained operations
table data |> where status = 'active' |> select sum(value) as total;### Why
are the changes needed?
The previous restriction requiring the `|> AGGREGATE` keyword for all
aggregation operations was unnecessarily strict and inconsistent with standard
SQL syntax. This limitation:
1. **Reduced usability**: Users had to learn a Spark-specific syntax
restriction
2. **Lacked flexibility**: Simple aggregations required verbose `|>
AGGREGATE` syntax
3. **Created confusion**: The restriction didn't align with SQL semantics
where aggregates work naturally in SELECT clauses
By lifting this restriction (with an opt-out mechanism), we make the SQL
pipe operator syntax more intuitive and consistent with standard SQL while
maintaining backwards compatibility.
### Does this PR introduce _any_ user-facing change?
**Yes**, but it is **backwards compatible**:
- **Previously failing queries now succeed**: Queries using aggregate
functions in `|> SELECT`, `|> EXTEND`, etc. will now work instead of throwing
`PIPE_OPERATOR_CONTAINS_AGGREGATE_FUNCTION` errors
- **All previously succeeding queries continue to work**: No regression;
queries using `|> AGGREGATE` or non-aggregate pipe operators are unaffected
- **Opt-out available**: Users can restore the previous strict behavior by
setting `spark.sql.pipeOperator.allowAggregateInSelect=false`
**Backwards Compatibility Guarantee:**
- ✅ No queries that worked before will break
- ✅ Only queries that previously failed will now succeed
- ✅ Configuration available to disable new behavior if needed
### How was this patch tested?
1. **Unit Tests**: Added comprehensive test coverage in `pipe-operators.sql`:
- Positive tests: aggregates in SELECT, EXTEND, with WHERE, with
chaining, with GROUP BY
- Negative tests: aggregates in WHERE (still fails as expected)
- Configuration tests: toggling between enabled/disabled states
- Regression tests: verified `|> AGGREGATE` still works correctly
2. **Golden Files**: Regenerated and verified `pipe-operators.sql.out` and
analyzer results
3. **Test Execution**: All tests pass successfully:
### Was this patch authored or co-authored using generative AI tooling?
Yes, `claude-4.5-sonnet` with manual editing and approval.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]