[jira] [Comment Edited] (DRILL-5376) Rationalize Drill's row structure for simpler code, better performance

Jinfeng Ni (JIRA) Thu, 23 Mar 2017 09:45:07 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938739#comment-15938739
 ]


Jinfeng Ni edited comment on DRILL-5376 at 3/23/17 4:44 PM:
------------------------------------------------------------

I'm not fully convinced this is the right idea, until I see some prototype 
showing the advantage of row based structure over column based structure. For 
one thing regarding name based vs index based, it's true that Drill execution 
used name based approach, in stead of index or position based approach which is 
commonly used in traditional RDBMS. That's because schema could be different, 
in the sense of column order, additional columns, and the name based approach 
is designed to handle that. For instance, if I have two json files.   The query 
"select A,B from dfs.`/path/to/jsonfiles` will work using named based approach. 
I'm not clear how it would work for position-based execution in your row-based 
structure. 

{code}
{"A" : "foo1",
 "B" : "foo2"
}
{"B" : "foo3",
 "A" : "foo4"
}
{code}

One point regarding the efficiency of name based approach: the name-based 
resolution only happens at batch level, not at row level, and the name-based 
resolution only happens when there is a new schema. If the schema remains same 
for upcoming batches, name-based resolution does not have to happen. 






was (Author: jni):
I'm not fully convinced this is the right idea, until I see some prototype 
showing the advantage of row based structure over column based structure. For 
one thing regarding name based vs index based, it's true that Drill execution 
used name based approach, in stead of index or position based approach which is 
commonly used in traditional RDBMS. That's because schema could be different, 
in the sense of column order, additional columns, and the name based approach 
is designed to handle that. For instance, if I have two json files.   The query 
"select A,B from dfs.`/path/to/jsonfiles` will work using named based approach. 
I'm not clear how it would work for position-based execution in your row-based 
structure. 

{code}
{"A" : "foo1",
 "B" " "foo2"
}
{"B" : "foo3",
 "A" : "foo4"
}
{code}

One point regarding the efficiency of name based approach: the name-based 
resolution only happens at batch level, not at row level, and the name-based 
resolution only happens when there is a new schema. If the schema remains same 
for upcoming batches, name-based resolution does not have to happen. 





> Rationalize Drill's row structure for simpler code, better performance
> ----------------------------------------------------------------------
>
>                 Key: DRILL-5376
>                 URL: https://issues.apache.org/jira/browse/DRILL-5376
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Drill is a columnar system, but data is ultimately represented as rows (AKA 
> records or tuples.) The way that Drill represents rows leads to excessive 
> code complexity and runtime cost.
> Data in Drill is stored in vectors: one (or more) per column. Vectors do not 
> stand alone, however, they are "bundled" into various forms of grouping: the 
> {{VectorContainer}}, {{RecordBatch}}, {{VectorAccessible}}, 
> {{VectorAccessibleSerializable}}, and more. Each has slightly different 
> semantics, requiring large amounts of code to bridge between the 
> representations.
> Consider only a simple row: one with only scalar columns. In classic 
> relational theory, such a row is a tuple:
> {code}
> R = (a, b, c, d, ...)
> {code}
> A tuple is defined as an ordered list of column values. Unlike a list or 
> array, the column values also have names and may have varying data types.
> In SQL, columns are referenced by either position or name. In most execution 
> engines, columns are referenced by position (since positions, in most 
> systems, cannot change.) A 1:1 mapping is provided between names and 
> positions. (See the JDBC {{RecordSet}} interface.)
> This allows code to be very fast: code references columns by index, not by 
> name, avoiding name lookups for each column reference.
> Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, 
> for example) appear to provide a fixed column ordering, allowing indexed 
> column access. But, other abstractions provide only an iterator. Others (such 
> as {{VectorContainer}}) provides name-based access or, by clever programming, 
> indexed access.
> As a result, it is never clear exactly how to quickly access a column: by 
> name, by name to multi-part index to vector?
> Of course, Drill also supports maps, which add to the complexity. First, we 
> must understand that a "map" in Drill is not a "map" in the classic sense: it 
> is not a collection of (name, value) pairs in the JSON sense: a collection in 
> which each instance may have a different set of pairs.
> Instead, in Drill, a "map" is really a nested tuple: a map has the same 
> structure as a Drill record: a collection of names and values in which all 
> rows have the same structure. (This is so because maps are really a 
> collection of value vectors, and the vectors cut across all rows.)
> Drill, however, does not reflect this symmetry: that a row and a map are both 
> tuples. There are no common abstractions for the two. Instead, maps are 
> represented as a {{MapVector}} that contains a (name, vector) map for its 
> children.
> Because of this name-based mapping, high-speed indexed access to vectors is 
> not provided "out of the box." Certainly each consumer of a map can build its 
> own indexing mechanism. But, this leads to code complexity and redundancy.
> This ticket asks to rationalize Drill's row, map and schema abstractions 
> around the tuple concept. A schema is a description of a tuple and should (as 
> in JDBC) provide both name and index based access. That is, provide methods 
> of the form:
> {code}
> MaterializedField getField(int index);
> MaterializedField getField(String name);
> ...
> ValueVector getVector(int index);
> ValueVector getVector(String name);
> {code}
> Provide a common abstraction for rows and maps, recognizing their structural 
> similarity.
> There is an obvious issue with indexing columns in a row when the row 
> contains maps. Should indexing be multi-part (index into row, then into map) 
> as today? A better alternative is to provide a flattened interface:
> {code}
> 0: a, 1: b.x, 2: b.y, 3: c, ...
> {code}
> Use this change to simplify client code, over time, to use a simple 
> indexed-based column access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5376) Rationalize Drill's row structure for simpler code, better performance

Reply via email to