paul-rogers opened a new pull request, #13360:
URL: https://github.com/apache/druid/pull/13360
Refined external table functions
This PR introduces the first step toward catalog integration with MSQ
queries: a set of refined, easier-to-use table functions.
## Simplified Table Functions
Recall that MSQ introduced the `extern` function:
```sql
SELECT
<column>
FROM TABLE(
EXTERN(
'<Druid input source>',
'<Druid input format>',
'<row signature>'
)
)
```
As explained in the [Druid Catalog
Proposal](https://github.com/apache/druid/issues/12546), we wish to make the
user experience simpler by avoiding the use of JSON. We wish to use SQL syntax
instead. To do that, we must provide a solution for the input source, input
format and row signature.
The `extern` function handles all input sources. The first simplification is
to create functions for each supported input source. This PR introduces three:
* `inline` for the `InlineInputSource` (type `"input"`).
* `http` for the `HttpInputSource` (type `"http"`).
* `localFiles` for the `LocalInputSource` (type `"local"`). (Note that the
name is `localFiles` because `local` is a SQL reserved word.)
Second, we need a way to provide the various parameters for each input
source. We do that with SQL named arguments. Example:
```sql
FROM TABLE(
http(
userName => 'bob',
password => 'secret',
uris => 'http:foo.com/bar.csv',
format => 'csv'
)
)
```
The various input formats are defined as optional arguments. The `format`
argument choose the format.
Finally, we need a way to specify the schema. As it turns out, Calcite
introduced the `EXTEND` clause for Apache Phoenix (though it is not clear if
Phoenix ever actually used that clause.) we repurpose that clause for our needs:
```sql
FROM TABLE(
http(
userName => 'bob',
password => 'secret',
uris => 'http:foo.com/bar.csv',
format => 'csv'
)
) EXTEND (x VARCHAR, y VARCHAR, z BIGINT)
```
The `EXTEND` list is simply a list of column definitions in standard SQL
form.
### Implementation Notes
The set of arguments available for each function is defined by the catalog
"meta-metadata" introduced in the recent [Catalog
Basics](https://github.com/apache/druid/pull/13165) PR. A set of "operator
conversion" classes create SQL parameter definitions based on the property
definitions of each external table type defined by the catalog. Thinking ahead,
this same metadata will allow us to parameterize catalog tables in the next PR.
The `EXTEND` notation required some amount of Calcite trickery. The Calcite
code expects syntax of the form:
```text
<TABLENAME> EXTENDS <extend-list>
```
However, we want something of the form:
```text
TABLE ( <name> ( <args ) ) EXTENDS <extend-list>
```
We must change the parser to achieve the above. This is done via a Python
script, `edit-parser.py`, which modifies the parser during the code generation
phase. The parser-editing approach was taken because it is less intrusive than
an approach based on copying and replacing the relevant bits of the Calcite
grammar.
As set of custom extensions to Calcite classes effectively "hides" our usage
from Calcite by passing the schema behind Calcite's back. Specifically, we
rewrite the `EXTEND` node to copy the schema into a temporary operator node,
then discard the `EXTEND` node so that it does not confuse Calcite.
The plumbing required for the above made it possible to cache the external
table produced by the table function to avoid the normal Calcite behavior which
recreates the external table multiple times during planning.
## Calcite Test Revisions
This PR continues the effort to enhance the Calcite test framework.
* `SqlTestFramework` moves another step closer to being driven by Guice. A
new method, `configureGuice()` allows using Druid modules to bring in
dependencies. This method replaces a large amount of ad-hoc code that
previously existed in Calcite tests.
* `SqlTestFramework` also adds an `applyProperties(.)` method to let tests
set up properly values to be used in tests. The method is put to work in
several aggregate tests to select the aggregator to use for `COUNT DISTINCT`.
The result replaces additional ad-hoc code.
* Several previous `SqlTestFramework` methods are removed in favor of the
above.
* A new `PlannerCaptureHook`, patterned after Calcite's `QueryHook` class,
captures information from the Druid planner for use in test verification.
* The `QueryTestBuilder` and runner provide a way to compare the Calcite
logical plan against an expected value. The plan is captured using the hook
above.
## Release Notes
Release notes should mention the availability of the the new functions. This
is not the entire set of input sources: the most important one (S3) is not yet
converted. Consult the documentation included in this PR for the details.
It turns out that serialization of the `LocalInputSource` is broken: no use
of this class has likely worked in practice. See [Issue
#13359](https://github.com/apache/druid/issues/13359) for details.
## Other Changes
The location of the generated parser in SQL changed from
`target/generated-sources/annotations` to `target/generated-sources`. The Maven
step in which we generate the parser changed from `initialize` to
`generate-sources`.
<hr>
This PR has:
- [X] been self-reviewed.
- [X] added documentation for new or modified features or behaviors.
- [X] a release note entry in the PR description.
- [X] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [X] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]