You can mistype a column that you are sorting on or joining on and get the same problem. Vicky has filed a bug for a subquery that looked to be returning wrong results, but one of the columns she was trying to refer to was not available within that scope, and we happily filled it in with all null values. This unfortunately is not a trivial problem to solve, but it is also not impossible to provide a better user experience here.
There are a couple of considerations from a dev perspective that I think need to be discussed here. We currently lack a way to record a list of warnings as a query is executing that can be shown alongside a result set. Even if we decide that this situation is not best solved with warnings, we will eventually need such a feature if we ever implement features like tunable exception suppression. This feature has been brought up on the list a few times, and would allow for ignoring a small percentage of malformed data. In these cases I would expect Drill to not only return results based on the properly formed data, but also report how much was ignored or discarded. Once we have a way to report warnings to the user, we also need to define when we need to issue the warning. The major justification for the current behavior is that we optimistically assume that we may run into the field in a later record in the case of formats that can change schema like JSON, or in static schema formats like parquet where we may be reading files with different schemas. A simple "solution" would be to turn off our behavior to materialize nulls into columns that we have not yet read out of any files. This materialization currently happens at a project or filter as we are materializing a expressions to evaluate against the column. Making such a simple change however would remove the legitimate case where the schema is reasonably evolving, in particular the case of adding a new column. The old files will lack it, and batches missing a column may arrive for expression evaluation first, with other batches that have the column arriving later. This case should continue to work. There are other discussions happening right now about how this isn't working in all cases, but these are bugs that should be fixed soon because they do not function as we expect them to today. The more correct solution would be to hold state for each column we materialize these phantom-nulls into, keeping track if we ever see data coming back from the scan in these columns. For those that never see real data coming back, we could give a warning message to say that none of the input files had any data in columns x,y and z. This is gets a little more complicated with other operators. For example, if we had a union all over a scan of two separate directories, what would be the expected behavior if one of the streams always lacked a column, but the other one did have it. Should we issue a warning in this case? There is also the consideration of how this interacts with the behavior of the file readers. In JSON, we consider a non-existent field to be the same as a field with an explicit null value. If a user were to read JSON file that contained only nulls in a few of the columns, they would receive a warning that a particular column was not found in the file. This could also be confusing, but I think it is a much smaller problem than the cases where we look to return correct results but the user made a simple mistake or misunderstand how Drill works and we didn't try to make them aware of it. On Tue, Jun 23, 2015 at 12:57 PM, Abhishek Girish <[email protected] > wrote: > Hey all, > > I observed an issue while working with Window Functions. I observed a case > where wrong results are returned from Drill. > > In-case of weak schema such as parquet, Drill does not validate column > names. It is understandable when only part of the projection list in a > query. But when part of a Window Function, the results displayed are wrong, > and at times hard to identify the cause. > > Two examples below: > > > SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk > ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2; > +---------+ > | EXPR$0 | > +---------+ > | 0.0 | > | 0.0 | > +---------+ > 2 rows selected (7.116 seconds) > > SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY s.ss_stoe_sk, > s.s_customr_sk) FROM store_sales s LIMIT 2; > +---------+ > | EXPR$0 | > +---------+ > | 1.0 | > | 1.0 | > +---------+ > 2 rows selected (8.361 seconds) > > In both cases above, some columns do not exist. > > With normal aggregate functions, it is similar to having a non-existent > column in projection list. Drill prints a column of null rows. This could > still be documented for users to expect "null" columns in results when > non-existent columns are part of a projection list. > > > SELECT s.ss_store_sk, avg (ssdfd), ssdfd FROM store_sales s GROUP BY > s.ss_store_sk, ssdfd LIMIT 2; > +--------------+---------+--------+ > | ss_store_sk | EXPR$1 | ssdfd | > +--------------+---------+--------+ > | 10 | null | null | > | 4 | null | null | > +--------------+---------+--------+ > 2 rows selected (1.252 seconds) > > But in case of window functions (and maybe other functions & expressions), > the results might look more real and hence difficult to identify that the > query had typos. Worse, users may trust the data returned from Drill, which > they shouldn't have. > > Postgres: > > # SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY > s.ss_store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2; > cume_dist > ---------------------- > 3.06415464350749e-05 > 3.06415464350749e-05 > (2 rows) > > # SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk > ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2; > > ERROR: column s.store_sk does not exist > > LINE 1: ...ARTITION BY s.store_sk, s.ss_customer_sk ORDER BY s.store_sk... > ^ > > I think Drill at minimum should throw a warning message when it encounters > a non-existent column. And ideally queries must fail when non-existent > columns are part of any function/expression. > > I'll file a JIRA if it is agreed to be an issue. > > Regards, > Abhishek >
