dtenedor opened a new pull request, #52983:
URL: https://github.com/apache/spark/pull/52983

   ### What changes were proposed in this pull request?
   
   This PR adds support for `|` as an alternative to `|>` for the SQL pipe 
operator token, while maintaining full backwards compatibility with existing 
bitwise OR operations.
   
   For example, this is now supported:
   
   ```sql
   table t
   | select x, y
   | where x < 2;
   ```
   
   as an alternative to:
   
   ```sql
   table t
   |> select x, y
   |> where x < 2;
   ```
   
   The implementation uses a semantic predicate with 2-token lookahead 
(`isOperatorPipeStart()`) to disambiguate between:
   - **Pipe operators**: When `|` is followed by keywords like `SELECT`, 
`WHERE`, `EXTEND`, `JOIN`, etc.
   - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`)
   
   This approach ensures that existing SQL queries using `|` for bitwise OR 
operations continue to work without any changes, including edge cases where 
column names match pipe operator keywords (e.g., `col1 | select` where `select` 
is a column name).
   
   ### Why are the changes needed?
   
   This provides syntax compatibility with other languages that use `|` for 
pipe operations, such as:
   - Splunk SPL
   - Kusto (KQL)
   - Unix shell pipes
   
   **Background:**
   
   We previously attempted this in https://github.com/apache/spark/pull/50284 
but abandoned that approach because it inadvertently broke bitwise OR 
expression usage. After further investigation, we've developed a solution using 
ANTLR semantic predicates that properly disambiguates the two contexts.
   
   As discussed in that PR:
   - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed 
that Google uses an LALR parser which makes it impossible for them to support 
`|` due to ambiguity with bitwise operations
   - There is growing industry consensus that `|>` should be the 
primary/universal token, but engines may optionally support additional tokens
   - This approach aligns with how other databases have addressed this (see 
https://superdb.org/docs/language/pipe-ambiguity/)
   
   Spark's use of ANTLR (not LALR) enables us to support both tokens through 
lookahead-based disambiguation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, users can now use `|` as a more concise alternative to `|>` for pipe 
operators.
   
   **This change is fully backwards-compatible:**
   - All existing queries using `|>` continue to work
   - All existing queries using `|` for bitwise OR continue to work
   - Users can even mix `|` and `|>` in the same query
   
   ### How was this patch tested?
   
   This PR includes comprehensive test coverage in `pipe-operators.sql`:
   
   **Positive tests:** All pipe operator types using `|` syntax:
   - SELECT, EXTEND, SET, DROP, AS, WHERE
   - PIVOT, UNPIVOT, TABLESAMPLE
   - JOIN (all types: inner, cross, left, right, full, semi, anti, lateral, 
natural)
   - Set operations (UNION, EXCEPT, INTERSECT, MINUS)
   - ORDER BY, LIMIT, OFFSET, DISTRIBUTE BY, CLUSTER BY
   - AGGREGATE, WINDOW
   - Chained operations with multiple pipe operators
   - Mixed `|` and `|>` syntax in the same query
   
   **Negative tests (backwards compatibility):** Bitwise OR operations with 
keyword column names:
   - Tested bitwise OR with columns named: `select`, `extend`, `set`, `drop`, 
`as`, `where`, `order`, `limit`, `aggregate`, `window`, `pivot`, `unpivot`, 
`join`, `union`, `intersect`, `except`
   - Complex bitwise OR expressions: `(col1 | select) + (where | order)`
   - Multiple bitwise OR chains: `col1 | select | where | order`
   - Bitwise OR in WHERE clauses: `where (col1 | select) > 2`
   
   All tests pass successfully.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, `claude-4.5-sonnet` with manual review and editing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to