[
https://issues.apache.org/jira/browse/IMPALA-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783867#comment-16783867
]
Tim Armstrong commented on IMPALA-8265:
---------------------------------------
Yeah I think I understand the problem now. It's definitely an unexpected
interaction between multiple decisions that made sense in isolation. Definitely
we don't want any surprising behaviour, but we also don't want to break any
existing workflows, so ideally we would have a solution that avoided both kinds
of issues. I think we'd have to consider making this a hard failure a breaking
change, so not valid in a minor release. Here are some ideas:
# We could do nothing, which ensures no existing workflows are broken, and try
to improve documentation. Potentially we could add a flag and/or switch the
behaviour later in a major version. This leaves the potential for confusion
among users.
# We could change it to a hard error immediately, maybe overridable by an
option. I think this is unacceptable because of potential for breakage
# We could change the behaviour so that it has the expected behaviour. This
solves the confusion and doesn't break existing workflows (aside from
weirdly-written queries getting slower because of the sort).
## Ordering is enforced only between rows with the same primary key. I.e. we
can still partition rows by the primary key and insert in parallel. This would
mean that the side-effects of inserts are not strictly ordered.
## Ordering is enforced among all rows. This would force us to send all rows
through the same node.
To me, options 1. and 3.1 seem viable. 3.1 requires some real work but avoids
the biggest downsides and makes some new workloads possible. We already insert
sorts before Kudu inserts/upserts but this changes the semantics a bit.
> Reject INSERT/UPSERT queries with ORDER BY and no OFFSET/LIMIT
> ---------------------------------------------------------------
>
> Key: IMPALA-8265
> URL: https://issues.apache.org/jira/browse/IMPALA-8265
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Andy Stadtler
> Priority: Critical
>
> Currently Impala doesn't honor a order by without a limit or offset in a
> insert ... select operation. While Impala currently throws a warning it seems
> like this query should be rejected with the same message. Especially now with
> the UPSERT ability and Kudu its obvious logic to take a table of duplicate
> rows and use the following query.
> {code:java}
> UPSERT INTO kudu_table SELECT col1, col2, col3 FROM duplicate_row_table ORDER
> BY timestamp_column ASC;{code}
> Impala will happily take this query and write incorrect data. The same query
> works fine as a SELECT only query and it's easy to see where users would make
> the mistake of reusing it in an INSERT/UPSERT.
>
> Rejecting the query with the warning message would make sure the user knew
> the ORDER BY would not be honored and make sure they added a limit, changed
> their query logic or removed the order by.
>
> {quote}*Sorting considerations:* Although you can specify an {{ORDER BY}}
> clause in an {{INSERT ... SELECT}} statement, any {{ORDER BY}} clause is
> ignored and the results are not necessarily sorted. An {{INSERT ... SELECT}}
> operation potentially creates many different data files, prepared on
> different data nodes, and therefore the notion of the data being stored in
> sorted order is impractical.
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]