[GitHub] [druid] paul-rogers commented on pull request #12851: fix nested column sql operator return type inference

GitBox Tue, 02 Aug 2022 17:24:54 -0700


paul-rogers commented on PR #12851:
URL: https://github.com/apache/druid/pull/12851#issuecomment-1203345602


   @clintropolis, thanks for this fix, you've done some great foundation work 
that the catalog project will build upon.
   
   The catalog is presently focused on aggregates, but there seems to be some 
interesting synergy with the JSON columns.
   
   Let's think a bit about the actual type names. There are three types to 
consider:
   
   * The input type (what can go into a complex column such as JSON or an 
aggregate)
   * The storage type (in segments)
   * The output type of queries (after "finalizing" the aggregates).
   
   The current (very preliminary) thinking is for the catalog to use an 
"aggregate type" that describes all of the above (and the "seed", "combiner" 
and "reducer" functions.) The storage type for an aggregate is, I gather, 
either a simple type (`long` for `SUM(<long value>)` or a complex type (for 
things like histograms.) We want the SQL type to describe the aggregate itself, 
not just the storage format. So, we've been playing with type names such as 
`SUM(BIGINT)` or `HISTOGRAM(VARCHAR)`. 
   
   The idea is that the type name acts as a key into a table that gives is the 
storage type, the "seed" function (to convert a single row to an aggregate), 
the "combiner" function (used to combine aggregates), and a "reducer" (used to 
convert the aggregate to a simple value returned by queries.)
   
   The idea is that the user provides the type name (in whatever syntax we 
decide on), and the system then works out all the internal bits for both 
`INSERT` and `SELECT` statements. The existing `COMPLEX<foo>` pattern doesn't 
quite capture the info we need, hence the idea of `<agg>(<type>)`, including 
multi-argument forms.
   
   Returning to JSON, I wonder if the same ideas apply? Looking at how other 
products handle JSON, 
[SqlServer](https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16)
 uses type `VARCHAR`. 
[MySQL](https://dev.mysql.com/doc/refman/8.0/en/json.html) uses type `JSON`:
   
   ```sql
   CREATE TABLE t1 (jdoc JSON)
   ```
   
   Druid has no `CREATE TABLE` statement, but the catalog will bring us a step 
closer: we need a type in the catalog that says that column `jdoc` holds JSON 
data. The suggestion here is to maybe use `COMPLEX<json>`. The hint from MySQL 
is to just use `JSON`.
   
   There is mention in the description of the row signature of a `SELECT`. This 
is also an interesting case. For that, we might want to learn from SqlServer. 
If what we return is a string, then the query output type should be a simple 
`VARCHAR` since the return value looks like a `VARCHAR`, quacks like a 
`VARCHAR` and actually *is* a `VARCHAR` to JDBC. That is, the "reduced" form of 
JSON is just a `VARCHAR`: it looses all its special "JSON-ness" when returned 
to the client.
   
   Hence, the planner has to know that, when selecting a column of type `JSON` 
(or `COMPLEX<json>`), the output type is `VARCHAR`.
   
   All of that is probably a bit outside the scope of this particular PR, but 
it seems like we'll soon need to resolve these questions for both the catalog 
and JSON projects, and that solution will leverage the work you've done here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #12851: fix nested column sql operator return type inference

Reply via email to