[ 
https://issues.apache.org/jira/browse/HIVE-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164332#comment-14164332
 ] 

Alan Gates commented on HIVE-8367:
----------------------------------

bq. What was the original query where the issue showed up?
{code}
create table concur_orc_tab(name varchar(50), age int, gpa decimal(3, 2)) 
clustered by (age) into 2 buckets stored as orc TBLPROPERTIES 
('transactional'='true');
insert into table concur_orc_tab select * from texttab; -- loads 10k records 
into the table
delete from concur_orc_tab where age >= 20 and age < 30;
{code}
This resulted in only some rows being deleted (~300 of the 1700 that should 
have been deleted)

What precisely was the problem and how does the RS deduplication change help?
The problem was that because the code was turning off the RS deduplication it 
was getting a plan with two MR jobs.  The sort by ROW__ID was done in job one, 
and the bucketing was done in job two.  This meant that the bucketing in job 2 
partially undid the sorting of job 1, resulting in only some of the records 
showing up as deleted (since the records have to be written in the delta file 
in proper order).  The minimum number of reducers on which to apply the RS 
deduplication is pushed to 1 so that this optimization is used for even small 
queries.  

How is the changes to sort order of ROW__ID related?
That should never have been set to descending in the first place.  ROW__ID 
needs to be stored ascending to work properly.  I suspect it was a fluke of 
most of the qfile tests that they worked with this on.  (Actually Thejas asked 
at the time why this was necessary and rather than fixing it (which I should 
have done) I just said I didn't know.  Oops.)  

bq.  ReduceSinkDeDuplication.java change is not needed
What change?  I don't see any changes to that file in the patch.

> delete writes records in wrong order in some cases
> --------------------------------------------------
>
>                 Key: HIVE-8367
>                 URL: https://issues.apache.org/jira/browse/HIVE-8367
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8367.patch
>
>
> I have found one query with 10k records where you do:
> create table
> insert into table -- 10k records
> delete from table -- just some records
> The records in the delete delta are not ordered properly by rowid.
> I assume this applies to updates as well, but I haven't tested it yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to