iffyio commented on code in PR #1435: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1435#discussion_r1846920232
########## README.md: ########## @@ -100,6 +100,23 @@ similar semantics are represented with the same AST. We welcome PRs to fix such issues and distinguish different syntaxes in the AST. +## WIP: Extracting source locations from AST nodes + +This crate allows recovering source locations from AST nodes via the [Spanned](https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html) trait, which can be used for advanced diagnostics tooling. Node that this feature is a work in progress and many nodes report missing or inaccurate spans. Please see [this document](./docs/source_spans.md#source-span-contributing-guidelines) for information on how to contribute missing improvements. Review Comment: ```suggestion This crate allows recovering source locations from AST nodes via the [Spanned](https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html) trait, which can be used for advanced diagnostics tooling. Note that this feature is a work in progress and many nodes report missing or inaccurate spans. Please see [this document](./docs/source_spans.md#source-span-contributing-guidelines) for information on how to contribute missing improvements. ``` ########## docs/source_spans.md: ########## @@ -0,0 +1,52 @@ + +## Breaking Changes + +These are the current breaking changes introduced by the source spans feature: + +#### Added fields for spans (must be added to any existing pattern matches) +- `Ident` now stores a `Span` +- `Select`, `With`, `Cte`, `WildcardAdditionalOptions` now store a `TokenWithLocation` + +#### Misc. +- `TokenWithLocation` stores a full `Span`, rathern than just a source location. Users relying on `token.location` should use `token.location.start` instead. +## Source Span Contributing Guidelines + +For contributing source spans improvement in addition to the general [contribution guidelines](../README.md#contributing), please make sure to pay attention to the following: + + +### Source Span Design Considerations + +- `Ident` always have correct source spans +- Downstream breaking change impact is to be as minimal as possible +- To this end, use recursive merging of spans in favor of storing spans on all nodes +- Any metadata added to compute spans must not change semantics (Eq, Ord, Hash, etc.) + +The primary reason for missing and inaccurate source spans at this time is missing spans of keyword tokens and values in many structures, either due to lack of time or because adding them would break downstream significantly. + +When considering adding support for source spans on a type, consider the impact to consumers of that type and whether your change would require a consumer to do non-trivial changes to their code. + +f.e. of a trivial change +```rust +match node { + ast::Query { + field1, + field2, + location: _, // add a new line to ignored location +} +``` + +If adding source spans to a type would require a significant change like wrapping that type or similar, please open an issue to discuss. + +### AST Node Equality and Hashes + +When adding tokens to AST nodes, make sure to wrap them in the [IgnoreField](https://docs.rs/sqlparser/latest/sqlparser/ast/helpers/struct.IgnoreField.html) helper to ensure that semantically equivalent AST nodes always compare as equal and hash to the same value. F.e. `select 5` and `SELECT 5` would compare as different `Select` nodes, if the select token was stored directly. f.e. Review Comment: We can update this part to the new struct? ########## tests/sqlparser_postgres.rs: ########## @@ -2507,6 +2520,7 @@ fn parse_array_subquery_expr() { op: SetOperator::Union, set_quantifier: SetQuantifier::None, left: Box::new(SetExpr::Select(Box::new(Select { + select_token: TokenWithLocation::wrap(Token::make_keyword("SELECT")).into(), Review Comment: To make places like tests that don't care about the token leaner, could we add a helper function like `AttachedToken::empty()` that returns a dummy token? ########## src/parser/mod.rs: ########## @@ -9178,8 +9204,9 @@ impl<'a> Parser<'a> { } /// Parse a restricted `SELECT` statement (no CTEs / `UNION` / `ORDER BY`), - /// assuming the initial `SELECT` was already consumed + /// assuming the initial `SELECT` was not yet consumed Review Comment: ```suggestion ``` Figured, we could remove it entirely since that's the default/expected behavior ########## src/tokenizer.rs: ########## @@ -431,36 +433,143 @@ pub struct Location { } impl fmt::Display for Location { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { if self.line == 0 { return Ok(()); } - write!( - f, - // TODO: use standard compiler location syntax (<path>:<line>:<col>) - " at Line: {}, Column: {}", - self.line, self.column, - ) + write!(f, " at Line: {}, Column: {}", self.line, self.column) + } +} + +impl fmt::Debug for Location { + fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { + write!(f, "Location({},{})", self.line, self.column) + } +} + +impl Location { + pub fn of(line: u64, column: u64) -> Self { + Self { line, column } + } + + pub fn span_to(self, end: Self) -> Span { + Span { start: self, end } + } +} + +impl From<(u64, u64)> for Location { + fn from((line, column): (u64, u64)) -> Self { + Self { line, column } + } +} + +/// A span of source code locations (start, end) +#[derive(Eq, PartialEq, Hash, Clone, PartialOrd, Ord, Copy)] +#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] +#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] +pub struct Span { + pub start: Location, + pub end: Location, +} + +impl fmt::Debug for Span { + fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { + write!(f, "Span({:?}..{:?})", self.start, self.end) + } +} + +impl Span { + // An empty span (0, 0) -> (0, 0) + // We need a const instance for pattern matching + const EMPTY: Span = Self::empty(); + + pub fn new(start: Location, end: Location) -> Span { + Span { start, end } + } + + /// Returns an empty span (0, 0) -> (0, 0) + /// Empty spans represent no knowledge of source location + pub const fn empty() -> Span { + Span { + start: Location { line: 0, column: 0 }, + end: Location { line: 0, column: 0 }, + } + } + + /// Returns the smallest Span that contains both `self` and `other` + /// If either span is [Span::empty], the other span is returned + pub fn union(&self, other: &Span) -> Span { + // If either span is empty, return the other + // this prevents propagating (0, 0) through the tree + match (self, other) { + (&Span::EMPTY, _) => *other, + (_, &Span::EMPTY) => *self, + _ => Span { + start: cmp::min(self.start, other.start), + end: cmp::max(self.end, other.end), + }, + } + } + + /// Same as [Span::union] for `Option<Span>` + /// If `other` is `None`, `self` is returned + pub fn union_opt(&self, other: &Option<Span>) -> Span { + match other { + Some(other) => self.union(other), + None => *self, + } } } /// A [Token] with [Location] attached to it -#[derive(Debug, Eq, PartialEq, Clone)] +#[derive(Debug, Eq, Clone)] +#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] +#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub struct TokenWithLocation { pub token: Token, - pub location: Location, + pub span: Span, } impl TokenWithLocation { - pub fn new(token: Token, line: u64, column: u64) -> TokenWithLocation { - TokenWithLocation { - token, - location: Location { line, column }, - } + pub fn new(token: Token, span: Span) -> TokenWithLocation { + TokenWithLocation { token, span } } pub fn wrap(token: Token) -> TokenWithLocation { - TokenWithLocation::new(token, 0, 0) + TokenWithLocation::new(token, Span::empty()) + } + + pub fn at(token: Token, start: Location, end: Location) -> TokenWithLocation { + TokenWithLocation::new(token, Span::new(start, end)) + } +} + +impl core::hash::Hash for TokenWithLocation { + fn hash<H: hash::Hasher>(&self, state: &mut H) { + let TokenWithLocation { token, span: _ } = self; + + token.hash(state); + } +} + +impl PartialEq<TokenWithLocation> for TokenWithLocation { + fn eq(&self, other: &TokenWithLocation) -> bool { + let TokenWithLocation { token, span: _ } = self; + + token == &other.token + } +} + +impl PartialOrd<TokenWithLocation> for TokenWithLocation { + fn partial_cmp(&self, other: &TokenWithLocation) -> Option<cmp::Ordering> { + Some(self.cmp(other)) + } +} + +impl Ord for TokenWithLocation { + fn cmp(&self, other: &Self) -> cmp::Ordering { + let TokenWithLocation { token, span: _ } = self; + token.cmp(&other.token) Review Comment: Ah do we still need these custom impls now that we have the wrapper type or we can go back to the derives for TokenWithLocation now? ########## src/parser/mod.rs: ########## @@ -3243,23 +3250,30 @@ impl<'a> Parser<'a> { pub fn expected<T>(&self, expected: &str, found: TokenWithLocation) -> Result<T, ParserError> { parser_err!( format!("Expected: {expected}, found: {found}"), - found.location + found.span.start ) } /// If the current token is the `expected` keyword, consume it and returns /// true. Otherwise, no tokens are consumed and returns false. #[must_use] pub fn parse_keyword(&mut self, expected: Keyword) -> bool { + self.parse_keyword_token(expected).is_some() + } + + #[must_use] + pub fn parse_keyword_token(&mut self, expected: Keyword) -> Option<TokenWithLocation> { match self.peek_token().token { - Token::Word(w) if expected == w.keyword => { - self.next_token(); - true - } - _ => false, + Token::Word(w) if expected == w.keyword => Some(self.next_token()), + _ => None, } } + #[must_use] + pub fn peek_keyword(&mut self, expected: Keyword) -> bool { + matches!(self.peek_token().token, Token::Word(w) if expected == w.keyword) + } Review Comment: Ah heads up that the `peek_keyword` helper also landed recently on main, so that we should be able to skip this after merging in main to the branch ########## docs/source_spans.md: ########## @@ -0,0 +1,52 @@ + +## Breaking Changes + +These are the current breaking changes introduced by the source spans feature: + +#### Added fields for spans (must be added to any existing pattern matches) +- `Ident` now stores a `Span` +- `Select`, `With`, `Cte`, `WildcardAdditionalOptions` now store a `TokenWithLocation` + +#### Misc. +- `TokenWithLocation` stores a full `Span`, rathern than just a source location. Users relying on `token.location` should use `token.location.start` instead. Review Comment: ```suggestion - `TokenWithLocation` stores a full `Span`, rather than just a source location. Users relying on `token.location` should use `token.location.start` instead. ``` ########## src/ast/mod.rs: ########## @@ -902,10 +967,10 @@ pub enum Expr { /// `<search modifier>` opt_search_modifier: Option<SearchModifier>, }, - Wildcard, + Wildcard(TokenWithLocation), /// Qualified wildcard, e.g. `alias.*` or `schema.table.*`. /// (Same caveats apply to `QualifiedWildcard` as to `Wildcard`.) - QualifiedWildcard(ObjectName), + QualifiedWildcard(ObjectName, TokenWithLocation), Review Comment: ```suggestion QualifiedWildcard(ObjectName, AttachedToken), ``` ########## docs/source_spans.md: ########## @@ -0,0 +1,52 @@ + +## Breaking Changes + +These are the current breaking changes introduced by the source spans feature: + +#### Added fields for spans (must be added to any existing pattern matches) +- `Ident` now stores a `Span` +- `Select`, `With`, `Cte`, `WildcardAdditionalOptions` now store a `TokenWithLocation` + +#### Misc. +- `TokenWithLocation` stores a full `Span`, rathern than just a source location. Users relying on `token.location` should use `token.location.start` instead. +## Source Span Contributing Guidelines + +For contributing source spans improvement in addition to the general [contribution guidelines](../README.md#contributing), please make sure to pay attention to the following: + + +### Source Span Design Considerations + +- `Ident` always have correct source spans +- Downstream breaking change impact is to be as minimal as possible +- To this end, use recursive merging of spans in favor of storing spans on all nodes +- Any metadata added to compute spans must not change semantics (Eq, Ord, Hash, etc.) + +The primary reason for missing and inaccurate source spans at this time is missing spans of keyword tokens and values in many structures, either due to lack of time or because adding them would break downstream significantly. + +When considering adding support for source spans on a type, consider the impact to consumers of that type and whether your change would require a consumer to do non-trivial changes to their code. + +f.e. of a trivial change Review Comment: ```suggestion Example of a trivial change: ``` ########## src/ast/helpers/attached_token.rs: ########## @@ -0,0 +1,68 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use core::cmp::{Eq, Ord, Ordering, PartialEq, PartialOrd}; +use core::fmt::{self, Debug, Formatter}; +use core::hash::{Hash, Hasher}; + +use crate::tokenizer::TokenWithLocation; + +/// A wrapper type for attaching tokens to AST nodes that should be ignored in comparisons and hashing. +/// This should be used when a token is not relevant for semantics, but is still needed for +/// accurate source location tracking. +#[derive(Clone)] Review Comment: Ah I think we might be missing the serde and visit derive on this struct given its attached to structs in the ast? ########## src/ast/mod.rs: ########## @@ -902,10 +967,10 @@ pub enum Expr { /// `<search modifier>` opt_search_modifier: Option<SearchModifier>, }, - Wildcard, + Wildcard(TokenWithLocation), Review Comment: ```suggestion Wildcard(AttachedToken), ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org