[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

GitBox Tue, 12 Jul 2022 13:51:33 -0700


paul-rogers commented on issue #12546:
URL: https://github.com/apache/druid/issues/12546#issuecomment-1182489512


   ## Extended Table Functions
   
   The present version of Druid uses a Calcite feature to specify an ingest 
input table:
   
   ```sql
   INSERT INTO dst SELECT *
   FROM TABLE(extern(
     '{
       "type": "inline",
       "data": "a,b,1\nc,d,2\n"
      }',
     '{
       "type": "csv",
       "columns": ["x","y","z"],
       "listDelimiter": null,
       "findColumnsFromHeader": false,
       "skipHeaderRows": 0
     }',
     '[
       {"name": "x", "type": "STRING"},
       {"name": "y", "type": "STRING"},
       {"name": "z", "type": "LONG"}
     ]'
   ))
   PARTITIONED BY ALL TIME
   ```
   
   As it turns out, SQL (and Calcite) allow the use of named parameters. We can 
rewrite the above as follows. Notice the `name => value` syntax:
   
   ```sql
   INSERT INTO dst SELECT *
   FROM TABLE(extern(
     inputSource => '{
       "type": "inline",
       "data": "a,b,1\nc,d,2\n"
     }',
     inputFormat => '{
       "type": "csv",
       "columns": ["x","y","z"],
       "listDelimiter": null,
       "findColumnsFromHeader": false,
       "skipHeaderRows": 0
     }',
     signature => '[
       {"name": "x", "type": "STRING"},
       {"name": "y", "type": "STRING"},
       {"name": "z", "type": "LONG"}
     ]'
   ))
   PARTITIONED BY ALL TIME
   ```
   
   The above is great, but can be a bit awkward: we have to encode JSON in SQL 
(which, when we send via the REST API, we encode again in JSON.) Let's how we 
can use SQL named parameters to streamline the syntax (and set ourselves up for 
the catalog.) SQL requires that parameter names be "simple identifiers": that 
is, no dots. So, we can't just say:
   
   ```text
     "inputSource.type" => "inline"
   ```
   
   Instead, we have to "flatten" the names. That is, define SQL names that, 
internally, we map the the JSON names. The mapping is just code, so we omit the 
details here. Suppose we do the mapping. We now have a different set of 
arguments, so we need a different function. For now, let's call it `staged`.
   
   We also need a way to specify the input table schema. Here we borrow another 
bit of Calcite functionality, the `EXTEND` clause which was added for Apache 
Phoenix. We modify the syntax a bit to fit our needs. The result:
   
   ```sql
   INSERT INTO dst SELECT *
   FROM TABLE(staged(
     source => 'inline',
     data => 'a,b,1
   c,d,2
   ',
     format => 'csv'
     ))
     (x VARCHAR NOT NULL, y VARCHAR NOT NULL, z BIGINT NOT NULL)
   PARTITIONED BY ALL TIME
   ```
   
   Notice how the keywords in the `staged` function arguments match the 
`properties` defined in the REST call in the prior section. That is not an 
accident. That sets us up to merge the two ideas in the next update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12546: Druid Catalog Proposal

Reply via email to