Paul Rogers created DRILL-5376:
----------------------------------

             Summary: Rationalize Drill's row structure for simpler code, 
better performance
                 Key: DRILL-5376
                 URL: https://issues.apache.org/jira/browse/DRILL-5376
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers


Drill is a columnar system, but data is ultimately represented as rows (AKA 
records or tuples.) The way that Drill represents rows leads to excessive code 
complexity and runtime cost.

Data in Drill is stored in vectors: one (or more) per column. Vectors do not 
stand alone, however, they are "bundled" into various forms of grouping: the 
{{VectorContainer}}, {{RecordBatch}}, {{VectorAccessible}}, 
{{VectorAccessibleSerializable}}, and more. Each has slightly different 
semantics, requiring large amounts of code to bridge between the 
representations.

Consider only a simple row: one with only scalar columns. In classic relational 
theory, such a row is a tuple:

{code}
R = (a, b, c, d, ...)
{code}

A tuple is defined as an ordered list of column values. Unlike a list or array, 
the column values also have names and may have varying data types.

In SQL, columns are referenced by either position or name. In most execution 
engines, columns are referenced by position (since positions, in most systems, 
cannot change.) A 1:1 mapping is provided between names and positions. (See the 
JDBC {{RecordSet}} interface.)

This allows code to be very fast: code references columns by index, not by 
name, avoiding name lookups for each column reference.

Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, for 
example) appear to provide a fixed column ordering, allowing indexed column 
access. But, other abstractions provide only an iterator. Others (such as 
{{VectorContainer}}) provides name-based access or, by clever programming, 
indexed access.

As a result, it is never clear exactly how to quickly access a column: by name, 
by name to multi-part index to vector?

Of course, Drill also supports maps, which add to the complexity. First, we 
must understand that a "map" in Drill is not a "map" in the classic sense: it 
is not a collection of (name, value) pairs in the JSON sense: a collection in 
which each instance may have a different set of pairs.

Instead, in Drill, a "map" is really a nested tuple: a map has the same 
structure as a Drill record: a collection of names and values in which all rows 
have the same structure. (This is so because maps are really a collection of 
value vectors, and the vectors cut across all rows.)

Drill, however, does not reflect this symmetry: that a row and a map are both 
tuples. There are no common abstractions for the two. Instead, maps are 
represented as a {{MapVector}} that contains a (name, vector) map for its 
children.

Because of this name-based mapping, high-speed indexed access to vectors is not 
provided "out of the box." Certainly each consumer of a map can build its own 
indexing mechanism. But, this leads to code complexity and redundancy.

This ticket asks to rationalize Drill's row, map and schema abstractions around 
the tuple concept. A schema is a description of a tuple and should (as in JDBC) 
provide both name and index based access. That is, provide methods of the form:

{code}
MaterializedField getField(int index);
MaterializedField getField(String name);
...
ValueVector getVector(int index);
ValueVector getVector(String name);
{code}

Provide a common abstraction for rows and maps, recognizing their structural 
similarity.

There is an obvious issue with indexing columns in a row when the row contains 
maps. Should indexing be multi-part (index into row, then into map) as today? A 
better alternative is to provide a flattened interface:

{code}
0: a, 1: b.x, 2: b.y, 3: c, ...
{code}

Use this change to simplify client code, over time, to use a simple 
indexed-based column access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to