[ 
https://issues.apache.org/jira/browse/IMPALA-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783867#comment-16783867
 ] 

Tim Armstrong commented on IMPALA-8265:
---------------------------------------

Yeah I think I understand the problem now. It's definitely an unexpected 
interaction between multiple decisions that made sense in isolation. Definitely 
we don't want any surprising behaviour, but we also don't want to break any 
existing workflows, so ideally we would have a solution that avoided both kinds 
of issues. I think we'd have to consider making this a hard failure a breaking 
change, so not valid in a minor release. Here are some ideas:

# We could do nothing, which ensures no existing workflows are broken, and try 
to improve documentation. Potentially we could add a flag and/or switch the 
behaviour later in a major version. This leaves the potential for confusion 
among users.
# We could change it to a hard error immediately, maybe overridable by an 
option. I think this is unacceptable because of potential for breakage
# We could change the behaviour so that it has the expected behaviour. This 
solves the confusion and doesn't break existing workflows (aside from 
weirdly-written queries getting slower because of the sort).
## Ordering is enforced only between rows with the same primary key. I.e. we 
can still partition rows by the primary key and insert in parallel. This would 
mean that the side-effects of inserts are not strictly ordered.
## Ordering is enforced among all rows. This would force us to send all rows 
through the same node.

To me, options 1. and 3.1 seem viable. 3.1 requires some real work but avoids 
the biggest downsides and makes some new workloads possible. We already insert 
sorts before Kudu inserts/upserts but this changes the semantics a bit.

> Reject INSERT/UPSERT  queries with ORDER BY and no OFFSET/LIMIT
> ---------------------------------------------------------------
>
>                 Key: IMPALA-8265
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8265
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Andy Stadtler
>            Priority: Critical
>
> Currently Impala doesn't honor a order by without a limit or offset in a 
> insert ... select operation. While Impala currently throws a warning it seems 
> like this query should be rejected with the same message. Especially now with 
> the UPSERT ability and Kudu its obvious logic to take a table of duplicate 
> rows and use the following query.
> {code:java}
> UPSERT INTO kudu_table SELECT col1, col2, col3 FROM duplicate_row_table ORDER 
> BY timestamp_column ASC;{code}
> Impala will happily take this query and write incorrect data. The same query 
> works fine as a SELECT only query and it's easy to see where users would make 
> the mistake of reusing it in an INSERT/UPSERT.
>  
> Rejecting the query with the warning message would make sure the user knew 
> the ORDER BY would not be honored and make sure they added a limit, changed 
> their query logic or removed the order by.
>  
> {quote}*Sorting considerations:* Although you can specify an {{ORDER BY}} 
> clause in an {{INSERT ... SELECT}} statement, any {{ORDER BY}} clause is 
> ignored and the results are not necessarily sorted. An {{INSERT ... SELECT}} 
> operation potentially creates many different data files, prepared on 
> different data nodes, and therefore the notion of the data being stored in 
> sorted order is impractical.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to