Hi, We want add support for non-primitive types (ROW, ARRAY, MAP) to Apache Beam SQL DDL (based on Calcite DDL extensions). What would be the best way to approach this?
*Our Use Case:* We want to be able to use DDL to define data sources and sinks for Beam pipelines, so that users don't have to wrap SQL into custom code which configures sources/sinks. *What we have already:* We have a customized CREATE TABLE statement which allows users to specify the type of the data source, its schema, and data location. The implmentation is based on Calcite DDL extensions. *What we're missing:* We need to be able to define schemas with non-primitive types, e.g. arrays or rows, so that we can correctly describe data sources and sinks which supports such types. For example if we want to manipulate data in a stream of JSON objects, we want to be able to describe the JSON contents somehow, including arrays or nested objects. Or we would need similar types to interact with BigQuery which supports arrays and nested struct types. *Problem:* I tried to check if it is possible to extend the parser using the config.fmpp approach, so that we can hook into the Parser.TypeName() <https://github.com/apache/calcite/blob/a5d520df76602d25ed66627f08f5e0db4d048a77/core/src/main/codegen/templates/Parser.jj#L4439> method and parse the complex types ourselves. But Parser.DataType() <https://github.com/apache/calcite/blob/a5d520df76602d25ed66627f08f5e0db4d048a77/core/src/main/codegen/templates/Parser.jj#L4377> creates SqlDataTypeSpec only in two specific ways, without ability to extend it, so even if we parse the typename ourselves, we would not be able to construct the SqlDataTypeSpec in a way that supports arrays/rows. But even if we could, looking at SqlDataTypeSpec <https://github.com/apache/calcite/blob/09be7e74a6a4d1b1c4f640c8e69b5ebdd467d811/core/src/main/java/org/apache/calcite/sql/SqlDataTypeSpec.java#L327> it seems that it does not support creating arrays or rows as well: it calls typeFactory.createSqlType(typename) <https://github.com/apache/calcite/blob/09be7e74a6a4d1b1c4f640c8e69b5ebdd467d811/core/src/main/java/org/apache/calcite/sql/SqlDataTypeSpec.java#L350> which only <https://github.com/apache/calcite/blob/f47465236b7650f2280092b708fa39062fe79ffd/core/src/main/java/org/apache/calcite/sql/type/SqlTypeFactoryImpl.java#L49> creates basic types in this call. *Path forward:* It the above is correct, then it appears that we would need to patch Calcite in couple of places to support arrays, rows, and maps in DDL: - update Parser.jj to support parsing the type definitions for the required types and constructing SqlDataTypeSpec correctly for those cases; - update SqlDataTypeSpec.java to handle complex types and invoke correct typeFactory interfaces; *Questions:* - does the above sound sane/correct? - is there a similar work already tracked in Calcite somewhere? I saw something mentioned in CALCITE-2045 <https://issues.apache.org/jira/browse/CALCITE-2045?focusedCommentId=16351203&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16351203>, but didn't see any tracking Jiras specifically for this work yet; - is there a known/recommended/working syntax for such DDL? If there is none, then would it make sense to adopt something similar to BigQuery STRUCT/ARRAY definition <https://cloud.google.com/bigquery/docs/data-definition-language> ? Thank you, Anton
