[jira] [Commented] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files

Sergey Shelukhin (JIRA) Fri, 10 Mar 2017 16:09:46 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905909#comment-15905909
 ]


Sergey Shelukhin commented on HIVE-16177:
-----------------------------------------

Bucket handling in Hive in general is completely screwed, and inconsistent in 
different places (e.g. sample and IIRC some other code would just take files in 
order, regardless of names, and if there are fewer or more files than needed).

Maybe there needs to be some work to enforce it better via some cental utility 
or manager class that would get all files for a bucket and validate buckets 
more strictly.

> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
>                 Key: HIVE-16177
>                 URL: https://issues.apache.org/jira/browse/HIVE-16177
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc 
> TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can 
> be copy_N files and numbers rows in each bucket from 0 thus generating 
> duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this 
> is just a pre-requisite.  The new UT demonstrates the issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files

Reply via email to