Paul Rogers created DRILL-5958:
----------------------------------

             Summary: Revisit the List and RepeatedList vectors
                 Key: DRILL-5958
                 URL: https://issues.apache.org/jira/browse/DRILL-5958
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.11.0
            Reporter: Paul Rogers


Drill provides a List vector used when reading JSON data. The semantics of this 
vector are somewhat obscure and overly complex. This ticket asks to clean up 
the design and implementation of this vector.

h4. Current Behavior

Drill contains two kinds of repeated types:

* Repeated vectors, which exist for all Drill types.
* List vectors, which exist outside the repeated vector system.

Lists are rather hard to explain.

Drill has 38 types. Each type comes in three cardinalities: Required (0), 
Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a 
{{NullableIntVector}} and a {{RepeatedIntVector}}.

Lists are an an odd duck and exist outside of this system. A list is not simply 
another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is 
heterogeneous: it is just a list of something.

For this reason, the List type is closely associated with the Union type: a 
list is, essentially, a "repeated Union", though it is not implemented that way.

Strangely, Drill does have a {{RepeatedListVector}}, which introduces all 
manner of ambiguity. Combining these, the cardinality hierarchy for unions is:

* {{UnionVector}} (like an optional union type)
* {{ListVector}} (repeated union)
* {{RepeatedListVector}} (a 2D union array)
* {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note 
that this could also be implemented as a {{ListVector}} that contains a 
{{RepeatedListVector}}.)
* {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper 
grid.)
* And so on.

For a primitive type, such as Int, we have:

* {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
* {{RepeatedIntVector}} (a 1D list of Int)
* {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. 
Not that this could have been a {{RepeatedListVector}} that stores only ints.)
* {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of 
ints. This could also be formed by a {{ListVector}} that contains a 
{{ListVector}} that contains a {{RepeatedIntVector}} along with several other 
combinations.)

h4. Examples of Current Behavior

Lists and repeated types appeared to evolve to support JSON-like structures. 
For example:

{code}
{a: 10} {a: null}
{code}

Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.

{code}
{a: [10, 20]}
{code}

Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 

Drill does not allow nulls in such vectors, so we cannot represent:

{code}
{a: [10, null, 20]}
{code}

Once we go beyond 1D, we need lists:

{code}
{a: [[10, 20], [30, 40]]}
{code}

The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.

{code}
{a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
{code}

The above requires a {{RepeatedListVector}} that contains a 
{{RepeatedIntVector}}.

Similarly, since lists can hold any type (just like a union), we can have 
repeated objects:

{code}
{a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
{code}

The above would be represented as a {{ListVector}} that contains a 
{{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that 
contains a {{MapVector}}.)

Because the List vector is a union type, it can (presumably) also handle 
heterogeneous lists (though this needs to be checked to see if the code 
actually supports this case):

{code}
{a: [10, "fred", 123.45, null]}
{code}

Since unions support combinations of not just scalars, but also scalars and 
complex types, Drill can also support:

{code}
{a: [10, {b: "foo"}, null, [10, "bob"]]}
{code}

h4. Muddy Semantics

The above show a number of problems that make lists (and unions) far more 
complex than necessary:

* Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a 
{{RepeatedFooVector}}.
* Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a 
{{RepeatedListVector}} of {{FooVector}}.

The same solution used to handle extra layers of repetition is used to handle 
variant types (DRILL-5955):

* Lists can handle any combination of scalars.
* Lists can handle any structure type (map, repeated map, list, repeated list).
* Lists are thus not typed. They are not a "List of Int", they are just a List.

h4. Mapping to SQL

The above muddy semantics give rise to this question. Drill is a SQL engine, 
how do we map the List semantics to a relational schema? If we don't have a 
clean answer, then the List type, while clever, does not have a useful purpose 
and is instead distracting us from the larger question of how we map JSON-like 
structures to a relational schema.

This is not a new problem; it has been with the community at least since the 
days of XML. There simply XML (and JSON) structures that do not map to a 
relational structure, though all relational structures can be mapped to JSON.

Drill's job, then, is to encode any JSON structure which has a relational 
interpretation. IMHO, it is not Drill's job to encode non-relational 
structures. (In the same way, in years past, it was not the job of BI tools to 
encode an XML-encoded HTML document to produce charts or reports.)

h4. Allowable JSON Structure

Let us take a stab at the allowable JSON structures:

* Objects (at any level) which map to either the top-level row, or nested 
tuples (called "maps" in Drill.)
* Arrays of scalars which map to scalar arrays in Drill (which must be 
flattened into the global name space, using dotted names, to map to a JDBC or 
ODBC client.)
* Arrays of objects, which map to tuple arrays in Drill (which must be 
flattened or explicitly joined to map to a JDBC or ODBC client.)
* Fields with heterogeneous types, which map to union/variant types in Drill 
(and which must be reconciled to present as SQL to an ODBC/JDBD client.)
* Array with heterogeneous types, which is one convention for efficiently 
storing tuples. These map to an array of union/variant types, and must be 
mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC 
client.

This definition starts with the relational model, then determines the many ways 
the relational model can be encoded in JSON. This analysis suggests that we 
need not handle extreme cases:

* Multi-dimensional arrays (which can't be mapped to the relational model 
easily, though they could be mapped to an MDX-style multi-dimensional model.)
* Heterogeneous collections of simple and structured types except in the 
tuple-as-array case.

For the heterogeneous cases, Drill would need some up-front information to know 
that, say, an array represents a tuple and to expect the heterogeneous 
encoding. In this case, each array slot would have a single type; it would not 
make sense to have a tuple-as-array in which each slot is a variant. (There is 
no sane way to map such a structure back to a fixed-type relational model.)

h4. Proposal: Generic Array Vector

Given the above, we can see that the current List and Union types are far too 
general for use in a relational system. We can suggest simplifications.

* The union/variant type handles mismatched data types. (But, as explained in 
DRILL-5955, a better solution is to dispense with the union/vector type and 
instead allow the user to specify a type to which the input values are mapped 
at read time.)
* The map and repeated map vectors already handle nested tuples and nested 
tables (arrays of tuples).

Given this, there is really no need for the union or list types. But, suppose 
we wanted to keep them anyway (because, say, we are not convinced by the above 
arguments, Drill already has them, and the community does not want to take the 
time to consider whether they should be removed.)

One solution is to rationalize the type system:

* Define a base (required) vector for each primitive type and for the map 
(tuple) type.
* Define a single nullable vector type which holds one of the base vectors.
* Define a repeated vector type that holds one of the base vectors.
* If necessary, define a variant which can hold any primitive value and itself 
is considered a nullable vector.

In this model, there is exactly one mapping from any array structure to a 
vector structure.

|| JSON || Vector Structure ||
| {a: 10} | IntVector |
| {a: 10}, {a: null} | NullableVector(IntVector) |
| {a: 10}, {a: "foo"} | VariantVector |
| {a: \[10, 20]} | ArrayVector(IntVector) |
| {a: \[\[10, 20], \[30, 40]]} | ArrayVector(ArrayVector(IntVector)) |
| {a: {b: 10}} | TupleVector |
| {a: \[{b: 10}, {b: 20}]} | 
ArrayVector(TupleVector) |
| {a: \[10, "foo"]} | ArrayVector(VariantVector) |
| {a: \[10, "foo"]} | TupleVector (if hints provided) |

h4. Arrow

[Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution 
similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list 
of some type: {{List<Int>}}, say.

h4. Backward Compatibility

Because Drill exposes internal vector representation through the network APIs, 
the above is a breaking change for the network API. A solution such as 
DRILL-5947 is required before we can implement a solution such as the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to