You shouldn't need to solve this without dealing with SqlScopedShuttle. If the validator does its job (and yes, it uses SqlScopedShuttle to do its job) then all identifiers will be resolved to the objects they reference.
If I were you I would start by running SqlValidatorTest.testFieldOrigin in a debugger. You will see that after validation has succeeded it calls validator.getFieldOrigins(). Once you have that working, you can consider variations on a theme (e.g. using Babel parser rather than the Core parser, or getting different kinds of lineage information). Maybe you'll run into bugs, and if so, you can add those as test cases in SqlValidatorTest or Babel test, and then maybe someone (maybe you) will end up fixing SqlScopedShuttle, but there are several maybes before we get there. Julian On Thu, Nov 11, 2021 at 10:42 AM Mark Grey <mgthesec...@spotify.com.invalid> wrote: > > In my efforts to continue learning about the internals of Calcite, I've > been experimenting with applying Calcite's Babel parser to the task of > extracting fully qualified field references from a query. The use case > here would be to enumerate all the fields of a schema that are "touched" by > a given query, for the purpose of informing table owners about the usage by > consumers of particular fields. > > It seems there has been some prior interest in this use case, but the few > of the stackoverflow posts I could find on the topic are likely outdated. > One post that has proved helpful was this one from Julian > <https://stackoverflow.com/a/37554628/215608>, which explains the roles of > namespaces and scopes in the validation process. > > The comments in that thread mention SqlScopedShuttle only maintaining a > stack of SqlNode, but it seems it has changed to keep a stack of > SqlValidatorScopes? For the most part, I seem to be able to use that Deque > to match identifiers to their tables, following on an example drawn from > the getFieldOrigins method of SqlValidatorImpl > <https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/validate/SqlValidator.html#getFieldOrigins(org.apache.calcite.sql.SqlNode)>. > My shuttle just uses the visit method for SqlIdentifier to try to qualify > the identifier against the namespace associated with the current scope, > discarding all that fail to be found. > > Here's my really naaive approach with some test cases expressed as yaml > <https://gist.github.com/DeaconDesperado/416f11bf91d0ca60b3c30b22626f4178>. > Obviously there's some cases that fail badly with this approach, one of > which I've included wherein an UNNEST is used on an array field containing > subrecords. > > A few questions I have: > > - Does this approach make sense for this use case, or would it make more > sense to use an implementation of SqlValidator, which seems to maintain > more state as to the scopes + namespaces? > - Would it be easier to approach this particular use case from > navigating the generated relational algebra rather than the SQL AST? > - Does the static factory method for the shuttle make sense > idiomatically with the intended usage for SqlScopedShuttle? How typically > should one obtain the intialScope? I've tried to enumerate both the top > level SqlSelect as well as the potential for a list of CTEs using WITH > > Many thanks!