[jira] [Commented] (DRILL-5376) Rationalize Drill's row metadata for simpler code, better performance

Paul Rogers (JIRA) Fri, 24 Mar 2017 10:33:02 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940786#comment-15940786
 ]


Paul Rogers commented on DRILL-5376:
------------------------------------

Consider the schema of row (batches) and maps. A batch (row) schema is 
represented by {{BatchSchema}}. A row is just a tuple. The internal 
representation is an ordered list of fields (i.e. metadata description of a 
column):

{code}
  private final List<MaterializedField> fields;
{code}

A map, in Drill, is just a nested tuple: all maps within a record batch must 
have the same set of members. A single column vector represents the union of 
all instances of a given column ("a", say) across all the maps. So, the schema 
is also a tuple, but the schema is represented in {{MaterializedField}} as an 
ordered map:

{code}
  private final LinkedHashSet<MaterializedField> children;
{code}

This asymmetry means that code that looks up a field in a row differs from that 
which looks up a member of a map. Both are tuples, but the representations 
differ.

Further, the map schema, but not row schema, allows rapid name-based access. 
Name-based access at the row level requires a linear search.

Better would be to have a single {{TupleSchema}} that holds column definitions 
as both an ordered list (to allow indexed access) and as a map (to allow 
name-based access.) Use the same representation for both types of tuple. A 
column (field), simply references a {{TupleSchema}} if it is a map.



> Rationalize Drill's row metadata for simpler code, better performance
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5376
>                 URL: https://issues.apache.org/jira/browse/DRILL-5376
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Drill is a columnar system, but data is ultimately represented as rows (AKA 
> records or tuples.) The way that Drill represents rows leads to excessive 
> code complexity and runtime cost.
> Data in Drill is stored in vectors: one (or more) per column. Vectors do not 
> stand alone, however, they are "bundled" into various forms of grouping: the 
> {{VectorContainer}}, {{RecordBatch}}, {{VectorAccessible}}, 
> {{VectorAccessibleSerializable}}, and more. Each has slightly different 
> semantics, requiring large amounts of code to bridge between the 
> representations.
> Consider only a simple row: one with only scalar columns. In classic 
> relational theory, such a row is a tuple:
> {code}
> R = (a, b, c, d, ...)
> {code}
> A tuple is defined as an ordered list of column values. Unlike a list or 
> array, the column values also have names and may have varying data types.
> In SQL, columns are referenced by either position or name. In most execution 
> engines, columns are referenced by position (since positions, in most 
> systems, cannot change.) A 1:1 mapping is provided between names and 
> positions. (See the JDBC {{RecordSet}} interface.)
> This allows code to be very fast: code references columns by index, not by 
> name, avoiding name lookups for each column reference.
> Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, 
> for example) appear to provide a fixed column ordering, allowing indexed 
> column access. But, other abstractions provide only an iterator. Others (such 
> as {{VectorContainer}}) provides name-based access or, by clever programming, 
> indexed access.
> As a result, it is never clear exactly how to quickly access a column: by 
> name, by name to multi-part index to vector?
> Of course, Drill also supports maps, which add to the complexity. First, we 
> must understand that a "map" in Drill is not a "map" in the classic sense: it 
> is not a collection of (name, value) pairs in the JSON sense: a collection in 
> which each instance may have a different set of pairs.
> Instead, in Drill, a "map" is really a nested tuple: a map has the same 
> structure as a Drill record: a collection of names and values in which all 
> rows have the same structure. (This is so because maps are really a 
> collection of value vectors, and the vectors cut across all rows.)
> Drill, however, does not reflect this symmetry: that a row and a map are both 
> tuples. There are no common abstractions for the two. Instead, maps are 
> represented as a {{MapVector}} that contains a (name, vector) map for its 
> children.
> Because of this name-based mapping, high-speed indexed access to vectors is 
> not provided "out of the box." Certainly each consumer of a map can build its 
> own indexing mechanism. But, this leads to code complexity and redundancy.
> This ticket asks to rationalize Drill's row, map and schema abstractions 
> around the tuple concept. A schema is a description of a tuple and should (as 
> in JDBC) provide both name and index based access. That is, provide methods 
> of the form:
> {code}
> MaterializedField getField(int index);
> MaterializedField getField(String name);
> ...
> ValueVector getVector(int index);
> ValueVector getVector(String name);
> {code}
> Provide a common abstraction for rows and maps, recognizing their structural 
> similarity.
> There is an obvious issue with indexing columns in a row when the row 
> contains maps. Should indexing be multi-part (index into row, then into map) 
> as today? A better alternative is to provide a flattened interface:
> {code}
> 0: a, 1: b.x, 2: b.y, 3: c, ...
> {code}
> Use this change to simplify client code, over time, to use a simple 
> indexed-based column access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5376) Rationalize Drill's row metadata for simpler code, better performance

Reply via email to