Re: Drill should validate column names within window functions

Jason Altekruse Tue, 23 Jun 2015 14:43:34 -0700

You can mistype a column that you are sorting on or joining on and get the
same problem. Vicky has filed a bug for a subquery that looked to be
returning wrong results, but one of the columns she was trying to refer to
was not available within that scope, and we happily filled it in with all
null values. This unfortunately is not a trivial problem to solve, but it
is also not impossible to provide a better user experience here.

There are a couple of considerations from a dev perspective that I think
need to be discussed here. We currently lack a way to record a list of
warnings as a query is executing that can be shown alongside a result set.
Even if we decide that this situation is not best solved with warnings, we
will eventually need such a feature if we ever implement features like
tunable exception suppression. This feature has been brought up on the list
a few times, and would allow for ignoring a small percentage of malformed
data. In these cases I would expect Drill to not only return results based
on the properly formed data, but also report how much was ignored or
discarded.

Once we have a way to report warnings to the user, we also need to define
when we need to issue the warning. The major justification for the current
behavior is that we optimistically assume that we may run into the field in
a later record in the case of formats that can change schema like JSON, or
in static schema formats like parquet where we may be reading files with
different schemas.

A simple "solution" would be to turn off our behavior to materialize nulls
into columns that we have not yet read out of any files. This
materialization currently happens at a project or filter as we are
materializing a expressions to evaluate against the column. Making such a
simple change however would remove the legitimate case where the schema is
reasonably evolving, in particular the case of adding a new column. The old
files will lack it, and batches missing a column may arrive for expression
evaluation first, with other batches that have the column arriving later.
This case should continue to work. There are other discussions happening
right now about how this isn't working in all cases, but these are bugs
that should be fixed soon because they do not function as we expect them to
today.

The more correct solution would be to hold state for each column we
materialize these phantom-nulls into, keeping track if we ever see data
coming back from the scan in these columns. For those that never see real
data coming back, we could give a warning message to say that none of the
input files had any data in columns x,y and z. This is gets a little more
complicated with other operators. For example, if we had a union all over a
scan of two separate directories, what would be the expected behavior if
one of the streams always lacked a column, but the other one did have it.
Should we issue a warning in this case?

There is also the consideration of how this interacts with the behavior of
the file readers. In JSON, we consider a non-existent field to be the same
as a field with an explicit null value. If a user were to read JSON file
that contained only nulls in a few of the columns, they would receive a
warning that a particular column was not found in the file. This could also
be confusing, but I think it is a much smaller problem than the cases where
we look to return correct results but the user made a simple mistake or
misunderstand how Drill works and we didn't try to make them aware of it.

On Tue, Jun 23, 2015 at 12:57 PM, Abhishek Girish <[email protected]
> wrote:

> Hey all,
>
> I observed an issue while working with Window Functions. I observed a case
> where wrong results are returned from Drill.
>
> In-case of weak schema such as parquet, Drill does not validate column
> names. It is understandable when only part of the projection list in a
> query. But when part of a Window Function, the results displayed are wrong,
> and at times hard to identify the cause.
>
> Two examples below:
>
> > SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
> ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
> +---------+
> | EXPR$0  |
> +---------+
> | 0.0     |
> | 0.0     |
> +---------+
> 2 rows selected (7.116 seconds)
>
> SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY s.ss_stoe_sk,
> s.s_customr_sk) FROM store_sales s LIMIT 2;
> +---------+
> | EXPR$0  |
> +---------+
> | 1.0     |
> | 1.0     |
> +---------+
> 2 rows selected (8.361 seconds)
>
> In both cases above, some columns do not exist.
>
> With normal aggregate functions, it is similar to having a non-existent
> column in projection list. Drill prints a column of null rows. This could
> still be documented for users to expect "null" columns in results when
> non-existent columns are part of a projection list.
>
> > SELECT s.ss_store_sk, avg (ssdfd), ssdfd FROM store_sales s GROUP BY
> s.ss_store_sk, ssdfd LIMIT 2;
> +--------------+---------+--------+
> | ss_store_sk  | EXPR$1  | ssdfd  |
> +--------------+---------+--------+
> | 10           | null    | null   |
> | 4            | null    | null   |
> +--------------+---------+--------+
> 2 rows selected (1.252 seconds)
>
> But in case of window functions (and maybe other functions & expressions),
> the results might look more real and hence difficult to identify that the
> query had typos. Worse, users may trust the data returned from Drill, which
> they shouldn't have.
>
> Postgres:
>
> # SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY
> s.ss_store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
>       cume_dist
> ----------------------
>  3.06415464350749e-05
>  3.06415464350749e-05
> (2 rows)
>
> # SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
> ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
>
> ERROR:  column s.store_sk does not exist
>
> LINE 1: ...ARTITION BY s.store_sk, s.ss_customer_sk ORDER BY s.store_sk...
>                                                            ^
>
> I think Drill at minimum should throw a warning message when it encounters
> a non-existent column. And ideally queries must fail when non-existent
> columns are part of any function/expression.
>
> I'll file a JIRA if it is agreed to be an issue.
>
> Regards,
> Abhishek
>

Re: Drill should validate column names within window functions

Reply via email to