[
https://issues.apache.org/jira/browse/DRILL-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235100#comment-16235100
]
Paul Rogers commented on DRILL-5822:
------------------------------------
[~vitalii], thanks for the explanation. I was hoping for a bit more of a
conceptual overview: what is our overall approach to handling column order? The
answer referred to specific implementations. Without a design, we just
ping-pong back and forth among various code solutions.
In Drill, column order does not matter. At least, that is what I've been told
by the veterans. That is, (a, b) and (b, a) are the same schema. Code
generation uses names to find columns, not column indexes as in most DB systems.
Given this, we need a policy. One policy would be that we preserve project list
ordering created by the planner. That works, except in the case of {{SELECT
*}}. The standard for SQL is that a {{SELECT *}} query preserves the column
order in the table. Fair enough.
But, in a distributed system, each table may have a different column order;
especially in files such as JSON that use key/value pairs. So, there is no
"right" order. Then what do we do?
We can make up an order (sort columns alphabetically as the prior code did.)
But, this will be surprising to users if their CSV file, say, has ID, Address,
Block and we produce an output of Address, Block, ID.
Another rule might be to preserve column order where possible, but when a
conflict occurs, choose a "first" batch and coerce others to match that. If the
merging receiver (which sees n batches in no real order) gets batch 3 first,
then 3 becomes the template and other batches are coerced to match.
If the Sort, say, gets batches sequentially, then the first one is the template
and others are coerced to match.
Makes sense? Good. Now, what about RecordBatchLoader? By itself, it can't do
the job. It needs help.
On the first batch, it can save an ordering. On the second batch, it can:
* Pick out columns that match its existing schema.
* If prior columns do not exist, it can fill in nulls (as long as the prior
column was nullable or an array.)
* If the prior column was required, or a new column appears, a hard schema
change must occur.
The result is that the batch loader absorbs trivial schema changes. I call this
"schema smoothing." But, it alerts the surrounding operator to larger issues.
Now, what about the merging receiver? The algorithm might be this:
* Start with the first batch. This "primes" the batch loader.
* Visit the second batch. The batch loader "smooths" the schema as described
above.
* Continue with the third, and so on.
* If, for any batch, a hard schema change occurs, we fail the query.
(Not that we could actually handle the schema change as described below; but
we'd end up with a very large number of very small batches and a very large
number of schema changes. Until Drill has a design for schema change, there is
really no point in adding this complexity.)
If we do the above, then we don't need the check in the new code that compare
all batches up front. Instead, we do the comparison one by one as we convert
them from wire to in-memory format using the record batch loader.
OK, that's the merging receiver. What about the union receiver? It can just
send a schema change event each time the schema changes. This could be noisy;
we might get a schema change on every batch. If we have three senders, X, Y and
Z, and each has a different schema, then we would get a stream something like
X1, Y1, Z1, X2, Y2, Z2 and we'd have a schema change with each one. Oh well.
But, if the batches simply differ in column order, the schema "smoothing"
described above will kick in and we get no schema changes.
This same logic can be applied to each operator where we might have a problem.
Note that the schema smoothing algorithm described above is not just a theory.
It is actually implemented, tested, and working in the "batch size control"
project. Here we are just reimplementing it in multiple places due to the
extraordinary delays in getting large code changes approved in Drill.
> The query with "SELECT *" with "ORDER BY" clause and `planner.slice_target`=1
> doesn't preserve column order
> -----------------------------------------------------------------------------------------------------------
>
> Key: DRILL-5822
> URL: https://issues.apache.org/jira/browse/DRILL-5822
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.11.0
> Reporter: Prasad Nagaraj Subramanya
> Assignee: Vitalii Diravka
> Priority: Major
> Fix For: 1.12.0
>
>
> Columns ordering doesn't preserve for the star query with sorting when this
> is planned into multiple fragments.
> Repro steps:
> 1) {code}alter session set `planner.slice_target`=1;{code}
> 2) ORDER BY clause in the query.
> Scenarios:
> {code}
> 0: jdbc:drill:zk=local> alter session reset `planner.slice_target`;
> +-------+--------------------------------+
> | ok | summary |
> +-------+--------------------------------+
> | true | planner.slice_target updated. |
> +-------+--------------------------------+
> 1 row selected (0.082 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by
> n_name limit 1;
> +--------------+----------+--------------+------------------------------------------------------+
> | n_nationkey | n_name | n_regionkey | n_comment
> |
> +--------------+----------+--------------+------------------------------------------------------+
> | 0 | ALGERIA | 0 | haggle. carefully final deposits
> detect slyly agai |
> +--------------+----------+--------------+------------------------------------------------------+
> 1 row selected (0.141 seconds)
> 0: jdbc:drill:zk=local> alter session set `planner.slice_target`=1;
> +-------+--------------------------------+
> | ok | summary |
> +-------+--------------------------------+
> | true | planner.slice_target updated. |
> +-------+--------------------------------+
> 1 row selected (0.091 seconds)
> 0: jdbc:drill:zk=local> select * from cp.`tpch/nation.parquet` order by
> n_name limit 1;
> +------------------------------------------------------+----------+--------------+--------------+
> | n_comment | n_name |
> n_nationkey | n_regionkey |
> +------------------------------------------------------+----------+--------------+--------------+
> | haggle. carefully final deposits detect slyly agai | ALGERIA | 0
> | 0 |
> +------------------------------------------------------+----------+--------------+--------------+
> 1 row selected (0.201 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)