[
https://issues.apache.org/jira/browse/CALCITE-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977238#comment-15977238
]
Maryann Xue commented on CALCITE-1748:
--------------------------------------
bq. Calcite has the assumption that a full schema tree is always available.
I don't think this is still true with {{SimpleCalciteSchema}}, which is
designed to load sub-schemas, tables, functions on the fly. But there are some
other issues {{SimpleCalciteSchema}} has not addressed yet.
I think Phoenix has the same requirement as Drill in terms of schema volatility
and wants to achieve the goal of CALCITE-1748 as well. What Phoenix does right
now is use the {{SimpleCalciteSchema}} (for we don't have pre-loaded schema
tree either) and maintain a read-consistent view within Phoenix's own
{{Schema}} implementor using a sub-schema map and a table map. Now the problem
is when and how to update the map if a DDL statement has changed the schema
objects, e.g., DROP a table, DROP a sub-schema, ALTER a table, etc. As a
walk-around, Phoenix uses the HOOK to clear up the maps at the beginning of a
new statement, which 1) is tricky and 2) Julian pointed out that this could be
faulty coz multiple statements can live at the same time.
I think there are several things here:
1) Explicit schema objects vs. implicit schema objects: Explicit schema objects
are initiated at the time of Connection creation and should probably the same
life cycle as the Connection. Explicit schema objects are usually added
explicitly through "addXXX" calls or with MODEL. What we focus on right now is
the implicit schema objects which we can choose to load dynamically and are
obtained by "getXXX" methods. Calcite should not rely on methods like
"getSubSchemaMap()" or "getTableMap()" when trying to validate a sub-schema or
a table, which I think is already good with {{SimpleCalciteSchema}}.
2) Read-consistent view within a Statement: Although we can choose to load a
schema object dynamically, we should always assume that the schema tree of each
Statement is a "snapshot" of a certain instant in time. For example, querying a
table with name "A", we should always be able to get the same Table object (or
null if "A" does not exist). Same with sub-schemas.
3) Schema updates visible to a new Statement: Any change made to the schema
should be reflected in the schema tree represented in the Statement that is
created after that change happens. Failing to do so (like in CALCITE-1742)
would make things look like the objects were being cached.
I'd like to propose a solution here based on the discussion Julian and I had
last week:
1) One root schema per Connection regarding explicit Schema objects.
2) A new root schema (different from the one with the Connection) per
Statement, with a snapshot copy of explicit objects from the root schema in the
Connection.
3) Implement read-consistency management in {{CalciteSchema}} using maps for
each type of implicit schema objects. Since we'll now have one root schema per
Statement, we don't have to worry about "update" or "delete" of these maps. We
only need to add to the maps every time the underlying {{Schema}} implementor
returns a new object or null, to make sure that we can get the exact same
answer next time this object name is queried.
4) Add a optional "timestamp" parameter in the signature of
{{SchemaFactory.create}}, indicating what time the schema snapshot should
represent. Note that even without this parameter, read consistency is
guaranteed by 3) already.
This solution would introduce only one change into the Schema SPI, which is the
optional "timestamp" parameter in {{SchemaFactory.create}}. Most of the other
changes will go into {{CalciteSchema}}. I believe that if I have captured the
requirement of both Drill and Phoenix correctly, we will be able to do
everything with the Schema SPI only, without having to override
{{CalciteCatalogReader}}. Any thoughts? Let me know if I have missed something.
> Make CalciteCatalogReader.getSchema extendable to support dynamically load
> schema tree - getSchema need to be set to protected to allow overriding
> --------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CALCITE-1748
> URL: https://issues.apache.org/jira/browse/CALCITE-1748
> Project: Calcite
> Issue Type: Bug
> Reporter: Chunhui Shi
> Assignee: Julian Hyde
>
> In system like Drill, there is a need to load partial schema (e.g. for only
> one storage plugin) only when needed. Since Drill has no way to get a full
> available schema tree before hand, nor could Drill cache available schema for
> a storage plugin(e.g. Hive, MongoDB) since the storage plugin may not have
> notification mechanism to update Schema tree timely.
>
> The proposed fix is to load schema dynamically as shown in
> https://issues.apache.org/jira/browse/DRILL-5089
> To achieve this, we need to make CalciteCatalogReader.getSchema to be
> protected so it could be overridden by derived class while the derived class
> can reuse other functionalities in CalciteCatalogReader class
> private CalciteSchema getSchema(Iterable<String> schemaNames,
> SqlNameMatcher nameMatcher)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)