Hey Nathaniel,
To parse DDL statements, you need to use a different parser (see
SqlParser.Config.withParserFactory
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/parser/SqlParser.Config.html#withParserFactory(org.apache.calcite.sql.parser.SqlParserImplFactory)>)
because core Calcite parser
<https://github.com/apache/calcite/blob/main/core/src/main/codegen/templates/Parser.jj>
does
not support DDL statements by design. You have two options:
1) Use SqlDdlParserImpl.FACTORY (from calcite-server package, as you've
said)
This is an extended parser that can also parse DDL queries.

2)  Use SqlBabelParserImpl.FACTORY (from calcite-babel package)
This is also an extended parser that can not only parse DDL queries, but
also many extra things not present in ISO SQL standard, such as Postgre's
infix CAST operator *::*
Since you're doing lineage generation I highly recommend using babel
package for maximum compatibility with different DBs' queries.

*Mini-explanation on how parsing works in Calcite*
Try Go To'ing to SqlParserImpl code in your IDE. You're going to see a huge
file, full of almost nonsensical if's.
The reason is simple: Parser code is build-time generated Java code based
on rules defined in core Calcite parser
<https://github.com/apache/calcite/blob/main/core/src/main/codegen/templates/Parser.jj>.
Parser.jj is a JavaCC file; JavaCC <https://javacc.github.io/javacc/> is a
parser generator that, based on rules you define, generates a pure Java
code which can parse LL(k) grammars and generate the parse tree (SqlNode).

Parser.jj, as seen on the repo, is not a pure JavaCC file however, but an
Apache FreeMarker template. There are strings in the Parser.jj file that
start with *${ .*
Those are placeholders used by Apache FreeMarker, a templating engine.
Extended parsers (babel, ddlparser) use those placeholders to insert their
custom parsing rules without directly affecting the core parser file (it
all still happens build-time, however).
How Babel parser does it, for example:
https://github.com/apache/calcite/blob/main/babel/src/main/codegen

*An advice for lineage generation*
If you're going to create a lineage generator, I highly recommend using a
relational tree (RelNode tree) instead of parse tree (SqlNode) if you have
access to DB tables.
After all, a lineage shows the relation between tables/columns, and so does
a relational tree. There is even a built-in method for lineage:
getExpressionLineage
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataQuery.html#getExpressionLineage(org.apache.calcite.rel.RelNode,org.apache.calcite.rex.RexNode)>

Another advice, if you're planning to support different kinds of DBs, you
will eventually run into something unparseable, and that will require the
extension of the core Parser/babel Parser. Since it's build time, I suggest
you submitting a PR to extend the parsers.

Good luck,
Askar



On 23 Jun 2023 Fri at 07:37 Nathaniel Vala <[email protected]>
wrote:

> Hi All,
>
> I have been trying to build a java tool that would let people map lineage
> by reading sql scripts (i.e. views or insert into etc.) Im having a little
> trouble with a couple of things and was hoping for some pointers.
>
> Firstly, I cant seem to parse any DDL statements (so trouble with `CREATE
> VIEW AS [SQL QUERY]`).I understand this is meant to be in the
> calcite-server module but cant really find anything).
> I decided to ignore the CREATE statements for the moment and just process
> the query to get the sources in it which was working on simple scripts but
> fail real quick when looking at things I've seen at enterprises.
>
> I have a GitHub repo here<https://github.com/Spydernaz/sqlLineage>, some
> pointers would be awesome.
>
> Kind Regards,
> Nathaniel Vala
>

Reply via email to