paul-rogers opened a new pull request, #13360:
URL: https://github.com/apache/druid/pull/13360

   Refined external table functions
   
   This PR introduces the first step toward catalog integration with MSQ 
queries: a set of refined, easier-to-use table functions.
   
   ## Simplified Table Functions
   
   Recall that MSQ introduced the `extern` function:
   
   ```sql
   SELECT
    <column>
   FROM TABLE(
     EXTERN(
       '<Druid input source>',
       '<Druid input format>',
       '<row signature>'
     )
   )
   ```
   
   As explained in the [Druid Catalog 
Proposal](https://github.com/apache/druid/issues/12546), we wish to make the 
user experience simpler by avoiding the use of JSON. We wish to use SQL syntax 
instead. To do that, we must provide a solution for the input source, input 
format and row signature.
   
   The `extern` function handles all input sources. The first simplification is 
to create functions for each supported input source. This PR introduces three:
   
   * `inline` for the `InlineInputSource` (type `"input"`).
   * `http` for the `HttpInputSource` (type `"http"`).
   * `localFiles` for the `LocalInputSource` (type `"local"`). (Note that the 
name is `localFiles` because `local` is a SQL reserved word.)
   
   Second, we need a way to provide the various parameters for each input 
source. We do that with SQL named arguments. Example:
   
   ```sql
   FROM TABLE(
     http(
       userName => 'bob',
       password => 'secret',
       uris => 'http:foo.com/bar.csv',
       format => 'csv'
       )
     )
   ```
   
   The various input formats are defined as optional arguments. The `format` 
argument choose the format.
   
   Finally, we need a way to specify the schema. As it turns out, Calcite 
introduced the `EXTEND` clause for Apache Phoenix (though it is not clear if 
Phoenix ever actually used that clause.) we repurpose that clause for our needs:
   
   ```sql
   FROM TABLE(
     http(
       userName => 'bob',
       password => 'secret',
       uris => 'http:foo.com/bar.csv',
       format => 'csv'
       )
     ) EXTEND (x VARCHAR, y VARCHAR, z BIGINT)
   ```
   
   The `EXTEND` list is simply a list of column definitions in standard SQL 
form.
   
   ### Implementation Notes
   
   The set of arguments available for each function is defined by the catalog 
"meta-metadata" introduced in the recent [Catalog 
Basics](https://github.com/apache/druid/pull/13165) PR. A set of "operator 
conversion" classes create SQL parameter definitions based on the property 
definitions of each external table type defined by the catalog. Thinking ahead, 
this same metadata will allow us to parameterize catalog tables in the next PR.
   
   The `EXTEND` notation required some amount of Calcite trickery. The Calcite 
code expects syntax of the form:
   
   ```text
   <TABLENAME> EXTENDS <extend-list>
   ```
   
   However, we want something of the form:
   
   ```text
   TABLE ( <name> ( <args ) ) EXTENDS <extend-list>
   ```
   
   We must change the parser to achieve the above. This is done via a Python 
script, `edit-parser.py`, which modifies the parser during the code generation 
phase. The parser-editing approach was taken because it is less intrusive than 
an approach based on copying and replacing the relevant bits of the Calcite 
grammar.
   
   As set of custom extensions to Calcite classes effectively "hides" our usage 
from Calcite by passing the schema behind Calcite's back. Specifically, we 
rewrite the `EXTEND` node to copy the schema into a temporary operator node, 
then discard the `EXTEND` node so that it does not confuse Calcite.
   
   The plumbing required for the above made it possible to cache the external 
table produced by the table function to avoid the normal Calcite behavior which 
recreates the external table multiple times during planning.
   
   ## Calcite Test Revisions
   
   This PR continues the effort to enhance the Calcite test framework.
   
   * `SqlTestFramework` moves another step closer to being driven by Guice. A 
new method, `configureGuice()` allows using Druid modules to bring in 
dependencies. This method replaces a large amount of ad-hoc code that 
previously existed in Calcite tests.
   * `SqlTestFramework` also adds an `applyProperties(.)` method to let tests 
set up properly values to be used in tests. The method is put to work in 
several aggregate tests to select the aggregator to use for `COUNT DISTINCT`. 
The result replaces additional ad-hoc code.
   * Several previous `SqlTestFramework` methods are removed in favor of the 
above.
   * A new `PlannerCaptureHook`, patterned after Calcite's `QueryHook` class, 
captures information from the Druid planner for use in test verification.
   * The `QueryTestBuilder` and runner provide a way to compare the Calcite 
logical plan against an expected value. The plan is captured using the hook 
above.
   
   ## Release Notes
   
   Release notes should mention the availability of the the new functions. This 
is not the entire set of input sources: the most important one (S3) is not yet 
converted. Consult the documentation included in this PR for the details.
   
   It turns out that serialization of the `LocalInputSource` is broken: no use 
of this class has likely worked in practice. See [Issue 
#13359](https://github.com/apache/druid/issues/13359) for details.
   
   ## Other Changes
   
   The location of the generated parser in SQL changed from 
`target/generated-sources/annotations` to `target/generated-sources`. The Maven 
step in which we generate the parser changed from `initialize` to 
`generate-sources`.
   
   <hr>
   
   This PR has:
   
   - [X] been self-reviewed.
   - [X] added documentation for new or modified features or behaviors.
   - [X] a release note entry in the PR description.
   - [X] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [X] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to