[ 
https://issues.apache.org/jira/browse/IMPALA-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781193#comment-16781193
 ] 

Tim Armstrong commented on IMPALA-8265:
---------------------------------------

Do you mean ORDER BY instead of SORT BY in the first example? I get a syntax 
error when running the first statement.

To make sure I understand, you're say that this is a correctness issue because 
as user might think that the ORDER BY will guarantee that conflicts will be 
resolved in a particular way?

I checked and a warning is generated with the ORDER BY version.
{noformat}
[localhost:21000] functional_kudu> explain UPSERT INTO test SELECT * FROM test 
ORDER BY timestamp_col ASC;
Query: explain UPSERT INTO test SELECT * FROM test ORDER BY timestamp_col ASC
+---------------------------------------------------------------------------------------------------+
| Explain String                                                                
                    |
+---------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=4.00MB Threads=3                    
                    |
| Per-Host Resource Estimates: Memory=171MB                                     
                    |
| WARNING: The following tables are missing relevant table and/or column 
statistics.                |
| functional_kudu.test                                                          
                    |
|                                                                               
                    |
| UPSERT INTO KUDU [functional_kudu.test]                                       
                    |
| |                                                                             
                    |
| 02:PARTIAL SORT                                                               
                    |
| |  order by: KuduPartition(functional_kudu.test.kudu_idx) ASC NULLS LAST, 
kudu_idx ASC NULLS LAST |
| |  row-size=94B cardinality=unavailable                                       
                    |
| |                                                                             
                    |
| 01:EXCHANGE [KUDU(KuduPartition(functional_kudu.test.kudu_idx))]              
                    |
| |                                                                             
                    |
| 00:SCAN KUDU [functional_kudu.test]                                           
                    |
|    row-size=98B cardinality=unavailable                                       
                    |
+---------------------------------------------------------------------------------------------------+
WARNINGS: Ignoring ORDER BY clause without LIMIT or OFFSET: ORDER BY 
timestamp_col ASC.
An ORDER BY appearing in a view, subquery, union operand, or an insert/ctas 
statement has no effect on the query result unless a LIMIT and/or OFFSET is 
used in conjunction with the ORDER BY.
{noformat}

We could probably debate whether the warning or a hard failure is better 
behaviour in an ideal world, but it's hard to justify changing it at this point 
and breaking working queries.

> Reject INSERT/UPSERT  queries with ORDER BY and no OFFSET/LIMIT
> ---------------------------------------------------------------
>
>                 Key: IMPALA-8265
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8265
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Andy Stadtler
>            Priority: Critical
>
> Currently Impala doesn't honor a sort by without a limit or offset in a 
> insert ... select operation. While Impala currently throws a warning it seems 
> like this query should be rejected with the same message. Especially now with 
> the UPSERT ability and Kudu its obvious logic to take a table of duplicate 
> rows and use the following query.
> {code:java}
> UPSERT INTO kudu_table SELECT col1, col2, col3 FROM duplicate_row_table SORT 
> BY timestamp_column ASC;{code}
> Impala will happily take this query and write incorrect data. The same query 
> works fine as a SELECT only query and it's easy to see where users would make 
> the mistake of reusing it in an INSERT/UPSERT.
>  
> Rejecting the query with the warning message would make sure the user knew 
> the ORDER BY would not be honored and make sure they added a limit, changed 
> their query logic or removed the order by.
>  
> {quote}*Sorting considerations:* Although you can specify an {{ORDER BY}} 
> clause in an {{INSERT ... SELECT}} statement, any {{ORDER BY}} clause is 
> ignored and the results are not necessarily sorted. An {{INSERT ... SELECT}} 
> operation potentially creates many different data files, prepared on 
> different data nodes, and therefore the notion of the data being stored in 
> sorted order is impractical.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to