Paul Rogers created DRILL-7553:
----------------------------------

             Summary: Modernize type management
                 Key: DRILL-7553
                 URL: https://issues.apache.org/jira/browse/DRILL-7553
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


This is a roll-up issue for our ongoing discussion around improving and 
modernizing Drill's runtime type system. At present, Drill approaches types 
vastly differently than most other DB and query tools:

 * Drill does little (or no) plan-time type checking and propagation. Instead, 
all type management is done at execution time, in each reader, in each 
operator, and ultimately in the client.
 * Drill allows structured types (Map, Dict, Arrays), but does not have the 
extended SQL statements to fully utilize these types.
* Drill supports varying types: two readers can both read column {{c}}, but can 
do so with different types. We've always hoped to discover some way to 
reconcile the types. But, at present, the functionality is buggy and 
incomplete. It is not clear that a viable solution exists. Drill also provides 
"formal" varying types: Union and List. These types are also not fully 
supported.

These three topics are closely related. "Schema-free" means we must infer types 
at read time and so Drill cannot do plan-type type analysis of the kind done in 
other engines. Because of schema-on-read (which is what "schema-free" really 
means), two readers can read different types for the same fields, and so we end 
up with varying or inconsistent types, and are forced to figure out some way to 
manage the conflicts.

The gist of the proposal explored in this ticket is to exploit the learning 
from other engines: to embrace types when available, and to impose tractable 
rules when types are discovered at run time.

h4. Proposal Summary

This is very much a discussion draft. Here are some suggestions to get started.

# Set as our goal to manage types at plan time. Runtime type discovery becomes 
a (limited) special case.
# Pull type resolution, propagation and checking into the planner where it can 
be done once per query. Move it out of execution where it must be done multiple 
times: once per operator per minor fragment. Implement the standard DB type 
checking and propagation rules. (These rules are currently implicitly 
implemented deep in the code gen code.)
# Generate operator code in the planner; send it to workers as part of the 
physical plan (to avoid the need to generate the code on each worker.)
# Provide schema-aware extensions for storage and format plugins so that they 
can advertise a schema when known. (Examples; Hive sources get schemas from 
HMS, JDBC sources get schema from the underlying database, Avro, Parquet and 
others obtain schema from the target files, etc.) This mechanism works with, 
but is in addition to, the Drill metastore. 
# Separate the concepts of "schema-free" (no plan-time schema) from 
"schema-on-read" (schema is known in the planner, and data is read into that 
schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for 
sources that need it), but does not attempt the impossible with schema-free 
(that is, we no longer read inconsistent data into a relational model and hope 
we can make it work.)
# For convenience, allow "schema-free" (no plan-time schema). The restriction 
is that all readers *must* produce the same schema It is a fatal (to the query) 
error for an operator to receive batches with different schemas. (The reasons 
can be discussed separately.)
# Preserve the Map, Dict and Array types, but with tighter semantics: all 
elements must be of the same type.
# Replace the Union and List types with a new type: Java objects. Java objects 
can be anything and can vary from row-to-row. Java types are processed using 
UDFs (or Drill functions.)
# All "extended" types (complex: Map, Dict and Array, or Java objects) must be 
reduced to primitive types in a top-level tuple if the client is ODBC (which 
cannot handle non-relational types.) The same is true if the destination is a 
simple sink such as CSV or JDBC.
# Provide a light-weight way to resolve schema ambiguities that are identified 
by the new, stricter type rules. The light-weight solution is either a file or 
some kind of simple Drill-managed registry akin to the plugin registry. Users 
can run a query, see if there are conflicting types, and, if so, add a 
resolution rule to the registry. The user then reruns the query with a clean 
result.

In the past couple of years we have made progress in some of these areas. This 
ticket suggests we bring those threads together in a coherent strategy.

h4. Arrow/Java/Fixed Block/Something Else Storage

The ideas here are independent of choices we might make for our internal data 
representation format. The above design works equally well with either Drill or 
Arrow vectors, or with something else entirely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to