[jira] [Comment Edited] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files

Eugene Koifman (JIRA) Wed, 15 Mar 2017 17:39:07 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927248#comment-15927248
 ]


Eugene Koifman edited comment on HIVE-16177 at 3/16/17 12:38 AM:
-----------------------------------------------------------------

For Compactor we should create a org.apache.hadoop.hive.ql.io.orc.Reader can 
can wrap individual Readers for all copy_N files for a given bucket and return 
rows in order.

Also, the AcidInputFormat should throw if it finds directory layout it doesn't 
understand.  This should never happen for data written after the table is made 
acid (Tez + CTAS + Union ?)  but can happen for non-acid tables converted to 
acid (before major compaction):

[5:17 PM] Sergey Shelukhin: 1) list bucketing
[5:18 PM] Sergey Shelukhin: 2) any time, if the MR recursive-whatever setting 
is enabled
[5:18 PM] Sergey Shelukhin: 3) iirc Hive can produce it sometimes from unions 
but I'm not sure

Probably _alter table T SET TBLPROPERTIES ('transactional'='true')_ should do 
some check the table to make sure it has directory structure Acid can handle 
and fail if not.  This may be expensive for a table with lots of partitions.


TezComplier.java has
{noformat}
// We require the use of recursive input dirs for union processing
    conf.setBoolean("mapred.input.dir.recursive", true);
{noformat}



was (Author: ekoifman):
For Compactor we should create a org.apache.hadoop.hive.ql.io.orc.Reader can 
can wrap individual Readers for all copy_N files for a given bucket and return 
rows in order.

Also, the AcidInputFormat should throw if it finds directory layout it doesn't 
understand.  This should never happen for data written after the table is made 
acid (Tez + CTAS + Union ?)  but can happen for non-acid tables converted to 
acid (before major compaction):

[5:17 PM] Sergey Shelukhin: 1) list bucketing
[5:18 PM] Sergey Shelukhin: 2) any time, if the MR recursive-whatever setting 
is enabled
[5:18 PM] Sergey Shelukhin: 3) iirc Hive can produce it sometimes from unions 
but I'm not sure

Probably _alter table T SET TBLPROPERTIES ('transactional'='true')_ should do 
some check the table to make sure it has directory structure Acid can handle 
and fail if not.  This may be expensive for a table with lots of partitions.



> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
>                 Key: HIVE-16177
>                 URL: https://issues.apache.org/jira/browse/HIVE-16177
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 0.14.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Blocker
>         Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch, 
> HIVE-16177.04.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc 
> TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can 
> be copy_N files and numbers rows in each bucket from 0 thus generating 
> duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this 
> is just a pre-requisite.  The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0}    
> file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001
>     1       2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() 
> demonstrating this
> This is because compactor doesn't handle copy_N files either (skips them)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HIVE-16177) non Acid to acid conversion doesn't handle _copy_N files

Reply via email to