[GitHub] [druid] paul-rogers opened a new pull request, #12835: Refactor DruidSchema & DruidTable

GitBox Thu, 28 Jul 2022 19:19:52 -0700


paul-rogers opened a new pull request, #12835:
URL: https://github.com/apache/druid/pull/12835


   Refactors the `DruidSchema` and `DruidTable` abstractions to prepare for the 
Druid Catalog.
   
   As we add the catalog, we’ll want to combine physical segment metadata 
information with “hints” provided by the catalog. This is best done if we tidy 
up the existing code to more clearly separate responsibilities.
   
   This PR is purely a refactoring move: no functionality changed. There is no 
difference to user functionality or external APIs. Functionality changes will 
come later as we add the catalog itself.
   
   ### `DruidSchema`
   
   In the present code, `DruidSchema` does three tasks:
   
   * Holds the segment metadata cache
   * Interfaces with an external schema manager
   * Acts as a schema to Calcite
   
   This PR splits those responsibilities.
   
   * `DruidSchema` holds the Calcite schema for the `druid` namespace, 
combining information fro the segment metadata cache, from the external schema 
manager and (later) from the catalog.
   * `SegmentMetadataCache` holds the segment metadata cache formerly in 
`DruidSchema`.
   
   ### `DruidTable`
   
   The present `DruidTable` class is a bit of a kitchen sink: it holds all the 
various kinds of tables which Druid supports, and uses if-statements to handle 
behavior that differs between types. Yet, any given `DruidTable` will handle 
only one such table type. To more clearly model the actual table types, we 
split `DruidTable` into several classes:
   
   * `DruidTable` becomes an abstract base class to hold Druid-specific methods.
   * `DatasourceTable` represents a datasource.
   * `ExternalTable` represents an external table, such as from `EXTERN` or 
(later) from the catalog.
   * `InlineTable` represents the internal case in which we attach data 
directly to a table.
   * `LookupTable` represents Druid’s lookup table mechanism.
   
   The new subclasses are more focused: they can be selective about the data 
they hold and the various predicates since they represent just one table type. 
This will be important as the catalog information will differ depending on 
table type and the new structure makes adding that logic cleaner.
   
   ### `DatasourceMetadata`
   
   Previously, the `DruidSchema` segment cache would work with `DruidTable` 
objects. With the catalog, we need a layer between the segment metadata and the 
table as presented to Calcite. To fix this, the new `SegmentMetadataCache` 
class uses a new `DatasourceMetadata` class as its cache entry to hold only the 
“physical” segment metadata information: it is up to the `DruidTable` to 
combine this with the catalog information in a later PR.
   
   ### More Efficient Table Resolution
   
   Calcite provides a convenient base class for schema objects: 
`AbstractSchema`. However, this class is a bit *too* convenient: all we have to 
do is provide a map of tables and Calcite does the rest. This means that, to 
resolve any single datasource, say, `foo`, we need to cache segment metadata, 
external schema information, and catalog information for *all* tables. Just so 
Calcite can do a map lookup.
   
   There is nothing special about `AbstractSchema`. We can handle table lookups 
ourselves. The new `AbstractTableSchema` does this. In fact, all the rest of 
Calcite wants is to resolve individual tables by name, and, for commands we 
don’t use, to provide a list of table names.
   
   `DruidSchema` now extends `AbstractTableSchema`. `SegmentMetadataCache` 
resolves individual tables (and provides table names.)
   
   ### `DruidSchemaManager` 
   
   `DruidSchemaManager` provides a way to specify table schemas externally. In 
this sense, it is similar to the catalog, but only for datasources. It 
originally followed the `AbstractSchema` pattern: it implements provide a map 
of tables. This PR provides new optional methods for the table lookup and table 
names operations. The default implementations work the same way that 
`AbstractSchema` works: we get the entire map and pick out the information we 
need. Extensions that use this API should be revised to support the individual 
operations instead. Druid code no longer calls the original `getTables()` 
method.
   
   The PR has one breaking change: since the `DruidSchemaManager`  map is 
read-only to the rest of Druid, we should return a `Map`, not a `ConcurrentMap`.
   
   <hr>
   
   This PR has:
   - [X] been self-reviewed.
   - [X] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [X] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers opened a new pull request, #12835: Refactor DruidSchema & DruidTable

Reply via email to