Paul Rogers created DRILL-7553:
----------------------------------
Summary: Modernize type management
Key: DRILL-7553
URL: https://issues.apache.org/jira/browse/DRILL-7553
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
This is a roll-up issue for our ongoing discussion around improving and
modernizing Drill's runtime type system. At present, Drill approaches types
vastly differently than most other DB and query tools:
* Drill does little (or no) plan-time type checking and propagation. Instead,
all type management is done at execution time, in each reader, in each
operator, and ultimately in the client.
* Drill allows structured types (Map, Dict, Arrays), but does not have the
extended SQL statements to fully utilize these types.
* Drill supports varying types: two readers can both read column {{c}}, but can
do so with different types. We've always hoped to discover some way to
reconcile the types. But, at present, the functionality is buggy and
incomplete. It is not clear that a viable solution exists. Drill also provides
"formal" varying types: Union and List. These types are also not fully
supported.
These three topics are closely related. "Schema-free" means we must infer types
at read time and so Drill cannot do plan-type type analysis of the kind done in
other engines. Because of schema-on-read (which is what "schema-free" really
means), two readers can read different types for the same fields, and so we end
up with varying or inconsistent types, and are forced to figure out some way to
manage the conflicts.
The gist of the proposal explored in this ticket is to exploit the learning
from other engines: to embrace types when available, and to impose tractable
rules when types are discovered at run time.
h4. Proposal Summary
This is very much a discussion draft. Here are some suggestions to get started.
# Set as our goal to manage types at plan time. Runtime type discovery becomes
a (limited) special case.
# Pull type resolution, propagation and checking into the planner where it can
be done once per query. Move it out of execution where it must be done multiple
times: once per operator per minor fragment. Implement the standard DB type
checking and propagation rules. (These rules are currently implicitly
implemented deep in the code gen code.)
# Generate operator code in the planner; send it to workers as part of the
physical plan (to avoid the need to generate the code on each worker.)
# Provide schema-aware extensions for storage and format plugins so that they
can advertise a schema when known. (Examples; Hive sources get schemas from
HMS, JDBC sources get schema from the underlying database, Avro, Parquet and
others obtain schema from the target files, etc.) This mechanism works with,
but is in addition to, the Drill metastore.
# Separate the concepts of "schema-free" (no plan-time schema) from
"schema-on-read" (schema is known in the planner, and data is read into that
schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for
sources that need it), but does not attempt the impossible with schema-free
(that is, we no longer read inconsistent data into a relational model and hope
we can make it work.)
# For convenience, allow "schema-free" (no plan-time schema). The restriction
is that all readers *must* produce the same schema It is a fatal (to the query)
error for an operator to receive batches with different schemas. (The reasons
can be discussed separately.)
# Preserve the Map, Dict and Array types, but with tighter semantics: all
elements must be of the same type.
# Replace the Union and List types with a new type: Java objects. Java objects
can be anything and can vary from row-to-row. Java types are processed using
UDFs (or Drill functions.)
# All "extended" types (complex: Map, Dict and Array, or Java objects) must be
reduced to primitive types in a top-level tuple if the client is ODBC (which
cannot handle non-relational types.) The same is true if the destination is a
simple sink such as CSV or JDBC.
# Provide a light-weight way to resolve schema ambiguities that are identified
by the new, stricter type rules. The light-weight solution is either a file or
some kind of simple Drill-managed registry akin to the plugin registry. Users
can run a query, see if there are conflicting types, and, if so, add a
resolution rule to the registry. The user then reruns the query with a clean
result.
In the past couple of years we have made progress in some of these areas. This
ticket suggests we bring those threads together in a coherent strategy.
h4. Arrow/Java/Fixed Block/Something Else Storage
The ideas here are independent of choices we might make for our internal data
representation format. The above design works equally well with either Drill or
Arrow vectors, or with something else entirely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)