Paul Rogers created DRILL-7545:
----------------------------------

             Summary: Projection ambiguities in complex types
                 Key: DRILL-7545
                 URL: https://issues.apache.org/jira/browse/DRILL-7545
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


Summarized from an e-mail chain on the dev mailing list:

We recently introduced the DICT type. We also added the EVF framework. We have 
a bit of code which parses the projection list, then checks if a column from a 
reader is consistent with projection. The idea is to ensure that the columns 
produced by a Scan will be valid when a Project later tries to use them with 
the given project list. And, if the Scan says it can support Project-push-down, 
then the Scan is obligated to do the full check.

First we'll explain how I'll solve the projection problem given your 
explanation. Then we'll point out three potential ambiguities. Thanks to Bohdan 
for his explanations.

The problems here are not due to any one person. As explained below, they are 
due to trying to add concepts into SQL which SQL is not well-suited to support.

h4. Projection for DICT Types

Queries go through two major steps: planing and execution. At the planning 
stage we use SQL syntax for the project list. For example:

{code:sql}
explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10]  FROM 
cp.`employee.json` e
{code}

The planner sends an execution plan to operators. The project list appears in 
JSON. For the above:

{code:json}
   "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ],
{code}

We see that the JSON works as Bohdan described:

* The SQL map "map.member" syntax is converted to "`map`.`member`" in the JSON 
plan.
* The SQL DICT "`dict`['key']" syntax is converted to a form identical to maps: 
"`dict`.`key`".
* The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in JSON.

That is, on the execution side, we can't tell the difference between a MAP and 
a DICT request. We also can't tell the difference between an Array and DICT 
request. Apparently, because of this, the Schema Path parser does not recognize 
DICT syntax.

Given the way projection works, "a.b" and "a['b']" are identical: either works 
for both a map or a DICT with VARCHAR keys. That is, we just say that map and 
array projection are both compatible with a DICT column?

h4. Projection Checking in Scan

Mentioned above is that a Scan that supports Project-push-down must ensure that 
the output columns match the projection list. Doing that check is quite easy 
when the projection is simple: `a`. The column `a` can match a data column `a` 
of any type.

The task is bit harder when the projection is an array `a[0]`. Since this now 
means either array or DICT with an INT key, this projected column can match:

* Any REPEATED type
* A LIST
* A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the 
UINTx types)
* A REPEATED DICT with any type of key
* A UNION (because a union might contain a repeated type)

We can also handle a map projection: `a.b` which matches:

* A (possibly repeated) map
* A (possibly repeated) DICT with VARCHAR keys
* A UNION (because a union might contain a possibly-repeated map)
* A LIST (because the list can contain a union which might contain a 
possibly-repeated map)

Things get very complex indeed when we have multiple qualifiers such as 
`a[0][1].b` which matches:

* A LIST that contains a repeated map
* A REPEATED LIST that contains a (possibly-repeated) map
* A DICT with an INT key that has a value of a repeated map
* A REPEATED DICT that contains an INT key that contains a MAP
* (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a 
VARCHAR key.

h4. DICT Projection Ambiguities

The DICT type introduces an ambiguity. Note above that `a.b` can refer to 
either a REPEATED or non-REPEATED MAP. If non-repeated, `a.b` means to get the 
one value for member `b` of map `a`. But, if the map is REPEATED, this means to 
project an array of `b` values obtained from the array of maps.

For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT 
of INT keys and REPEATED BIGINT values: that is ARRAY<DICT<INT, 
ARRAY<BIGINT>>>. Does `a[0][1]` mean to pull out the 0th element of the 
REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all 
the DICT array values where the key == 0 and then pull out the 1st value of the 
INT array? That is, because we have an implied (in all members of the array) 
syntax, one can interpret this case as:

{noformat}
repeatedDict[0].valueOf(1) --> ARRAY<BIGINT>
-- All the values in the key=1 array of element 0
{noformat}

or

{noformat}
repeatedDict.valueOf(0)[1] --> ARRAY<BIGINT>
-- All the values in the key=0, element 1 positions across all DICT elements
{noformat}

It would seem to make sense to prefer the first interpretation. Unfortunately, 
MAPs already use the second interpretation.

h4. UNION and LIST Ambiguities

Then there are our loosey-goosey types like UNION and LIST which become 
hyper-difficult: it is impossible to validate a projection against a type that 
contains anything. We just have to wait until execution time to see what 
happens to appear inside the type. These are even highly ambiguous:

Is `a[0].b` in a UNION a reference to:

* The 0th element of a LIST which contains a DICT with VARCHAR keys?
* The 0th element of a LIST which contains a MAP that contains column b?
* The 0th element of a REPEATED MAP that contains column b?
* The 0th element of a REPEATED DICT which contains a UNION which contains a 
MAP?
* The slice though a REPEATED LIST where a REPEATED DICT has a VARCHAR key of 
'b'.
* And so on...

That is, since we have no restrictions on what can go into a UNION or a LIST, 
it is possible for paths to be ambiguous if the UNION contains two subtypes 
that both support the requested path. At present our only recourse is to 
observe that no one is crazy enough to set up this scenario -- in part because 
UNION barely works. But, having a type system which is inherently ambiguous is 
not a Good Thing.

For now, I'm solving the UNION and LIST problem by doing nothing: these types 
don't fully work and no one uses them. So, I'll leave fixing this issue as an 
exercise for some future time.

h4. DICT Join Ambiguities

As I think about how DICT might work, I wonder about another case: where I want 
to do a lateral join of a DICT with its surrounding record. Given that key 
lookup looks the same as member lookup, how we differentiate the following two 
cases:

WHERE dict['key'] = 10 -- Filter

WHERE dict.key = parent.col -- Join between DICT keys and some other column

The answer is: we can't. We'd need different projection syntax to how that in 
the first case, 'key' is a key value, while in the second case, 'key' is the 
name of one of the two DICT fields.

And, if we change the syntax to differentiate map and DICT keys, then, I've got 
to rethink and recode the projection checking logic I just discussed.

h4. Solutions

In other engines, we have a known schema and we can just follow the projection 
path down through the types to see if the path is valid. Since the schema is 
known, the planner can do the projection planning. In Drill, things are 
backward; we generally don't know the shape of data until we've read it; then 
at some point, we have to match the projection against what we've read thus 
far. This leads to all kinds of special cases such as those we've been 
discussing.

Didn't we pick up DICT from the HIVE MAP type? How does HIVE handle this case 
(recognizing that HIVE has an up-front schema.) What path resolutions rules do 
you think we should use in Drill to avoid ambiguities?

This is, by the way, a reason that I think we've gone too far overboard in 
trying to fit complex types into SQL: the cases get ambiguous, SQL is not 
expressive enough to handle the complex references, and as a result our code 
gets exponentially complex and costly to maintain as we try to force-fit into 
SQL concepts that SQL is not designed to handle.

This stuff is *hard*; we're designing a complex data type and language system 
within a context that does not provide the required semantics.  We're leaving 
ambiguous (UNION, LIST) things which need to be defined for SQL to work. This 
is screaming at us: "we're doing it wrong!"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to