[jira] [Commented] (DRILL-5376) Rationalize Drill's row metadata for simpler code, better performance

Paul Rogers (JIRA) Fri, 24 Mar 2017 11:43:21 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940909#comment-15940909
 ]


Paul Rogers commented on DRILL-5376:
------------------------------------

To follow up on the above comment, it has been mentioned here several times 
that Drill maps are not truly maps; they are nested tuples. It is work clearly 
defining what this means.

A map is an object that contains name/value pairs. Each map instance is 
distinct from all others. That is, given two maps m1 and m2, we can have:

{code}
m1 = { a: 10, b: "fred" }
m2 = { custNo: "1234-XYZ", name: "wilma" }
{code}

The above definition of map is how maps work in Python, JSON, JavaScript, Java 
and so on. These are very handy structures indeed.

However, that is *not* how Drill uses the term "map." The term "map" in Drill 
refers to a nested tuple. So, let's define a tuple. Like a map, a tuple is a 
collection of named items. But, unlike a map, a tuple has a schema defined 
separately from each instance. So, for a classic DB table, we might have a 
schema {{(custID : INT, name : VARCHAR)}}. We can then have multiple tuple 
instances (often called rows or records): {{(100, "Bob"), (200, "Fred")}}, etc.

Tuples are handy for DB processing: the engine knows the schema and can store, 
process and display all tuples using a uniform schema.

A Drill "map" is really a tuple. All Drill map instances use the same schema in 
the form of a shared set of member vectors. Thus, in the {{(custID, name)}} 
example above, all map instances in a record batch share a set of scalar 
vectors: one for custID, another for name. (Impala, for example, has distinct 
types for a tuple (a "structure") and a true map (a "map".)

Since Drill maps are, in fact, tuples, this provides Drill with the ability to 
greatly simplify what would otherwise be complex code. (Of course, it also 
means that Drill cannot represent JSON documents that use maps as true maps, 
not as nested structures.)

Since maps are nested tuples, Drill can/should provide a simplified naming. 
Take the schema {{(a : INT, b : map(c : VARCHAR, d: INT))}}. Once we realize 
that maps are nested tuples, we can define a simplified, but equivalent, 
schema: {{a : INT, b.c : VARCHAR, b.d: INT}}. This means that access to a 
top-level scalar (a) should be identical to a nested scalar (b.c or b.d). And, 
all columns can allow indexed access: (0 = a, 1 = b.c, 2 = b.d).

Many of the comments here reflect that Drill has not fully come to terms with 
its own map implementation: some of the code tries to treat Drill maps as true 
maps, others treat it as what it is: a nested tuple. Moving to a full nested 
tuple view means:

* Map members are just scalars; they can be used in any expression such as 
sort, filter, hash keys, aggregation keys, and so on.
* Member access can be made uniform. Maps are a schema-time concept, not a data 
representation concept. All data is simply a bundle of vectors; the names of 
the vectors reflect the nested tuple structure.


> Rationalize Drill's row metadata for simpler code, better performance
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5376
>                 URL: https://issues.apache.org/jira/browse/DRILL-5376
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Drill is a columnar system, but data is ultimately represented as rows (AKA 
> records or tuples.) The way that Drill represents rows leads to excessive 
> code complexity and runtime cost.
> Data in Drill is stored in vectors: one (or more) per column. Vectors do not 
> stand alone, however, they are "bundled" into various forms of grouping: the 
> {{VectorContainer}}, {{RecordBatch}}, {{VectorAccessible}}, 
> {{VectorAccessibleSerializable}}, and more. Each has slightly different 
> semantics, requiring large amounts of code to bridge between the 
> representations.
> Consider only a simple row: one with only scalar columns. In classic 
> relational theory, such a row is a tuple:
> {code}
> R = (a, b, c, d, ...)
> {code}
> A tuple is defined as an ordered list of column values. Unlike a list or 
> array, the column values also have names and may have varying data types.
> In SQL, columns are referenced by either position or name. In most execution 
> engines, columns are referenced by position (since positions, in most 
> systems, cannot change.) A 1:1 mapping is provided between names and 
> positions. (See the JDBC {{RecordSet}} interface.)
> This allows code to be very fast: code references columns by index, not by 
> name, avoiding name lookups for each column reference.
> Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, 
> for example) appear to provide a fixed column ordering, allowing indexed 
> column access. But, other abstractions provide only an iterator. Others (such 
> as {{VectorContainer}}) provides name-based access or, by clever programming, 
> indexed access.
> As a result, it is never clear exactly how to quickly access a column: by 
> name, by name to multi-part index to vector?
> Of course, Drill also supports maps, which add to the complexity. First, we 
> must understand that a "map" in Drill is not a "map" in the classic sense: it 
> is not a collection of (name, value) pairs in the JSON sense: a collection in 
> which each instance may have a different set of pairs.
> Instead, in Drill, a "map" is really a nested tuple: a map has the same 
> structure as a Drill record: a collection of names and values in which all 
> rows have the same structure. (This is so because maps are really a 
> collection of value vectors, and the vectors cut across all rows.)
> Drill, however, does not reflect this symmetry: that a row and a map are both 
> tuples. There are no common abstractions for the two. Instead, maps are 
> represented as a {{MapVector}} that contains a (name, vector) map for its 
> children.
> Because of this name-based mapping, high-speed indexed access to vectors is 
> not provided "out of the box." Certainly each consumer of a map can build its 
> own indexing mechanism. But, this leads to code complexity and redundancy.
> This ticket asks to rationalize Drill's row, map and schema abstractions 
> around the tuple concept. A schema is a description of a tuple and should (as 
> in JDBC) provide both name and index based access. That is, provide methods 
> of the form:
> {code}
> MaterializedField getField(int index);
> MaterializedField getField(String name);
> ...
> ValueVector getVector(int index);
> ValueVector getVector(String name);
> {code}
> Provide a common abstraction for rows and maps, recognizing their structural 
> similarity.
> There is an obvious issue with indexing columns in a row when the row 
> contains maps. Should indexing be multi-part (index into row, then into map) 
> as today? A better alternative is to provide a flattened interface:
> {code}
> 0: a, 1: b.x, 2: b.y, 3: c, ...
> {code}
> Use this change to simplify client code, over time, to use a simple 
> indexed-based column access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5376) Rationalize Drill's row metadata for simpler code, better performance

Reply via email to