[PR] commit [spark]

via GitHub Mon, 10 Nov 2025 16:40:16 -0800


dtenedor opened a new pull request, #52987:
URL: https://github.com/apache/spark/pull/52987


   ### What changes were proposed in this pull request?
   
   This PR adds a new SQL configuration 
`spark.sql.pipeOperator.allowAggregateInSelect` (default: `true`) that allows 
aggregate functions to be used in pipe operator clauses such as `|> SELECT` and 
`|> EXTEND` without requiring the explicit `|> AGGREGATE` keyword.
   
   **Key changes:**
   
   1. **New Configuration** (`SQLConf.scala`):
      - Added `PIPE_OPERATOR_ALLOW_AGGREGATE_IN_SELECT` configuration (default: 
`true`)
      - When enabled, aggregate functions can be used in any pipe operator 
clause
      - When disabled, aggregate functions must use the `|> AGGREGATE` clause 
exclusively
   
   2. **Updated Validation Logic** (`pipeOperators.scala`):
      - Converted `ValidateAndStripPipeExpressions` from an object to a case 
class accepting the configuration
      - Modified validation to conditionally check for aggregates based on the 
configuration value
   
   3. **Analyzer Integration** (`Analyzer.scala`):
      - Updated to pass `conf.pipeOperatorAllowAggregateInSelect` to the 
validation rule
   
   4. **Comprehensive Test Coverage** (`pipe-operators.sql`):
      - Added tests for aggregates in `|> SELECT` and `|> EXTEND`
      - Tests for chaining, GROUP BY, and configuration toggling
      - Verified that `|> AGGREGATE` continues to work
      - Confirmed invalid queries (e.g., aggregates in WHERE) still fail 
appropriately
   
   **Example queries now supported:**
   
   -- Aggregate in SELECT
   table employees |> select sum(salary) as total_salary;
   
   -- Aggregate in EXTEND
   table sales |> extend avg(amount) as avg_amount;
   
   -- Aggregate with GROUP BY
   table orders |> select customer_id, count(*) as order_count group by 
customer_id;
   
   -- Chained operations
   table data |> where status = 'active' |> select sum(value) as total;### Why 
are the changes needed?
   
   The previous restriction requiring the `|> AGGREGATE` keyword for all 
aggregation operations was unnecessarily strict and inconsistent with standard 
SQL syntax. This limitation:
   
   1. **Reduced usability**: Users had to learn a Spark-specific syntax 
restriction
   2. **Lacked flexibility**: Simple aggregations required verbose `|> 
AGGREGATE` syntax
   3. **Created confusion**: The restriction didn't align with SQL semantics 
where aggregates work naturally in SELECT clauses
   
   By lifting this restriction (with an opt-out mechanism), we make the SQL 
pipe operator syntax more intuitive and consistent with standard SQL while 
maintaining backwards compatibility.
   
   ### Does this PR introduce _any_ user-facing change?
   
   **Yes**, but it is **backwards compatible**:
   
   - **Previously failing queries now succeed**: Queries using aggregate 
functions in `|> SELECT`, `|> EXTEND`, etc. will now work instead of throwing 
`PIPE_OPERATOR_CONTAINS_AGGREGATE_FUNCTION` errors
   - **All previously succeeding queries continue to work**: No regression; 
queries using `|> AGGREGATE` or non-aggregate pipe operators are unaffected
   - **Opt-out available**: Users can restore the previous strict behavior by 
setting `spark.sql.pipeOperator.allowAggregateInSelect=false`
   
   **Backwards Compatibility Guarantee:**
   - ✅ No queries that worked before will break
   - ✅ Only queries that previously failed will now succeed
   - ✅ Configuration available to disable new behavior if needed
   
   ### How was this patch tested?
   
   1. **Unit Tests**: Added comprehensive test coverage in `pipe-operators.sql`:
      - Positive tests: aggregates in SELECT, EXTEND, with WHERE, with 
chaining, with GROUP BY
      - Negative tests: aggregates in WHERE (still fails as expected)
      - Configuration tests: toggling between enabled/disabled states
      - Regression tests: verified `|> AGGREGATE` still works correctly
   
   2. **Golden Files**: Regenerated and verified `pipe-operators.sql.out` and 
analyzer results
   
   3. **Test Execution**: All tests pass successfully:
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, `claude-4.5-sonnet` with manual editing and approval.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] commit [spark]

Reply via email to