Paul Rogers created DRILL-5958:
----------------------------------
Summary: Revisit the List and RepeatedList vectors
Key: DRILL-5958
URL: https://issues.apache.org/jira/browse/DRILL-5958
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.11.0
Reporter: Paul Rogers
Drill provides a List vector used when reading JSON data. The semantics of this
vector are somewhat obscure and overly complex. This ticket asks to clean up
the design and implementation of this vector.
h4. Current Behavior
Drill contains two kinds of repeated types:
* Repeated vectors, which exist for all Drill types.
* List vectors, which exist outside the repeated vector system.
Lists are rather hard to explain.
Drill has 38 types. Each type comes in three cardinalities: Required (0),
Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a
{{NullableIntVector}} and a {{RepeatedIntVector}}.
Lists are an an odd duck and exist outside of this system. A list is not simply
another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is
heterogeneous: it is just a list of something.
For this reason, the List type is closely associated with the Union type: a
list is, essentially, a "repeated Union", though it is not implemented that way.
Strangely, Drill does have a {{RepeatedListVector}}, which introduces all
manner of ambiguity. Combining these, the cardinality hierarchy for unions is:
* {{UnionVector}} (like an optional union type)
* {{ListVector}} (repeated union)
* {{RepeatedListVector}} (a 2D union array)
* {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note
that this could also be implemented as a {{ListVector}} that contains a
{{RepeatedListVector}}.)
* {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper
grid.)
* And so on.
For a primitive type, such as Int, we have:
* {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
* {{RepeatedIntVector}} (a 1D list of Int)
* {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints.
Not that this could have been a {{RepeatedListVector}} that stores only ints.)
* {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of
ints. This could also be formed by a {{ListVector}} that contains a
{{ListVector}} that contains a {{RepeatedIntVector}} along with several other
combinations.)
h4. Examples of Current Behavior
Lists and repeated types appeared to evolve to support JSON-like structures.
For example:
{code}
{a: 10} {a: null}
{code}
Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.
{code}
{a: [10, 20]}
{code}
Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}.
Drill does not allow nulls in such vectors, so we cannot represent:
{code}
{a: [10, null, 20]}
{code}
Once we go beyond 1D, we need lists:
{code}
{a: [[10, 20], [30, 40]]}
{code}
The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.
{code}
{a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
{code}
The above requires a {{RepeatedListVector}} that contains a
{{RepeatedIntVector}}.
Similarly, since lists can hold any type (just like a union), we can have
repeated objects:
{code}
{a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
{code}
The above would be represented as a {{ListVector}} that contains a
{{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that
contains a {{MapVector}}.)
Because the List vector is a union type, it can (presumably) also handle
heterogeneous lists (though this needs to be checked to see if the code
actually supports this case):
{code}
{a: [10, "fred", 123.45, null]}
{code}
Since unions support combinations of not just scalars, but also scalars and
complex types, Drill can also support:
{code}
{a: [10, {b: "foo"}, null, [10, "bob"]]}
{code}
h4. Muddy Semantics
The above show a number of problems that make lists (and unions) far more
complex than necessary:
* Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a
{{RepeatedFooVector}}.
* Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a
{{RepeatedListVector}} of {{FooVector}}.
The same solution used to handle extra layers of repetition is used to handle
variant types (DRILL-5955):
* Lists can handle any combination of scalars.
* Lists can handle any structure type (map, repeated map, list, repeated list).
* Lists are thus not typed. They are not a "List of Int", they are just a List.
h4. Mapping to SQL
The above muddy semantics give rise to this question. Drill is a SQL engine,
how do we map the List semantics to a relational schema? If we don't have a
clean answer, then the List type, while clever, does not have a useful purpose
and is instead distracting us from the larger question of how we map JSON-like
structures to a relational schema.
This is not a new problem; it has been with the community at least since the
days of XML. There simply XML (and JSON) structures that do not map to a
relational structure, though all relational structures can be mapped to JSON.
Drill's job, then, is to encode any JSON structure which has a relational
interpretation. IMHO, it is not Drill's job to encode non-relational
structures. (In the same way, in years past, it was not the job of BI tools to
encode an XML-encoded HTML document to produce charts or reports.)
h4. Allowable JSON Structure
Let us take a stab at the allowable JSON structures:
* Objects (at any level) which map to either the top-level row, or nested
tuples (called "maps" in Drill.)
* Arrays of scalars which map to scalar arrays in Drill (which must be
flattened into the global name space, using dotted names, to map to a JDBC or
ODBC client.)
* Arrays of objects, which map to tuple arrays in Drill (which must be
flattened or explicitly joined to map to a JDBC or ODBC client.)
* Fields with heterogeneous types, which map to union/variant types in Drill
(and which must be reconciled to present as SQL to an ODBC/JDBD client.)
* Array with heterogeneous types, which is one convention for efficiently
storing tuples. These map to an array of union/variant types, and must be
mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC
client.
This definition starts with the relational model, then determines the many ways
the relational model can be encoded in JSON. This analysis suggests that we
need not handle extreme cases:
* Multi-dimensional arrays (which can't be mapped to the relational model
easily, though they could be mapped to an MDX-style multi-dimensional model.)
* Heterogeneous collections of simple and structured types except in the
tuple-as-array case.
For the heterogeneous cases, Drill would need some up-front information to know
that, say, an array represents a tuple and to expect the heterogeneous
encoding. In this case, each array slot would have a single type; it would not
make sense to have a tuple-as-array in which each slot is a variant. (There is
no sane way to map such a structure back to a fixed-type relational model.)
h4. Proposal: Generic Array Vector
Given the above, we can see that the current List and Union types are far too
general for use in a relational system. We can suggest simplifications.
* The union/variant type handles mismatched data types. (But, as explained in
DRILL-5955, a better solution is to dispense with the union/vector type and
instead allow the user to specify a type to which the input values are mapped
at read time.)
* The map and repeated map vectors already handle nested tuples and nested
tables (arrays of tuples).
Given this, there is really no need for the union or list types. But, suppose
we wanted to keep them anyway (because, say, we are not convinced by the above
arguments, Drill already has them, and the community does not want to take the
time to consider whether they should be removed.)
One solution is to rationalize the type system:
* Define a base (required) vector for each primitive type and for the map
(tuple) type.
* Define a single nullable vector type which holds one of the base vectors.
* Define a repeated vector type that holds one of the base vectors.
* If necessary, define a variant which can hold any primitive value and itself
is considered a nullable vector.
In this model, there is exactly one mapping from any array structure to a
vector structure.
|| JSON || Vector Structure ||
| {a: 10} | IntVector |
| {a: 10}, {a: null} | NullableVector(IntVector) |
| {a: 10}, {a: "foo"} | VariantVector |
| {a: \[10, 20]} | ArrayVector(IntVector) |
| {a: \[\[10, 20], \[30, 40]]} | ArrayVector(ArrayVector(IntVector)) |
| {a: {b: 10}} | TupleVector |
| {a: \[{b: 10}, {b: 20}]} |
ArrayVector(TupleVector) |
| {a: \[10, "foo"]} | ArrayVector(VariantVector) |
| {a: \[10, "foo"]} | TupleVector (if hints provided) |
h4. Arrow
[Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution
similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list
of some type: {{List<Int>}}, say.
h4. Backward Compatibility
Because Drill exposes internal vector representation through the network APIs,
the above is a breaking change for the network API. A solution such as
DRILL-5947 is required before we can implement a solution such as the above.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)