[
https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927248#comment-15927248
]
Eugene Koifman commented on HIVE-16177:
---------------------------------------
For Compactor we should create a org.apache.hadoop.hive.ql.io.orc.Reader can
can wrap individual Readers for all copy_N files for a given bucket and return
rows in order.
Also, the AcidInputFormat should throw if it finds directory layout it doesn't
understand. This should never happen for data written after the table is made
acid (Tez + CTAS + Union ?) but can happen for non-acid tables converted to
acid (before major compaction)
[5:17 PM] Sergey Shelukhin: 1) list bucketing
[5:18 PM] Sergey Shelukhin: 2) any time, if the MR recursive-whatever setting
is eanbled
[5:18 PM] Sergey Shelukhin: 3) iirc Hive can produce it sometimes from unions
but I'm not sure
Probably _alter table T SET TBLPROPERTIES ('transactional'='true')_ should do
some check the table to make sure it has directory structure Acid can handle
and fail if not. This may be expensive for a table with lots of partitions.
> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
> Key: HIVE-16177
> URL: https://issues.apache.org/jira/browse/HIVE-16177
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 0.14.0
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
> Priority: Blocker
> Attachments: HIVE-16177.01.patch, HIVE-16177.02.patch,
> HIVE-16177.04.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a) into 2 buckets stored as orc
> TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
> //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can
> be copy_N files and numbers rows in each bucket from 0 thus generating
> duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this
> is just a pre-requisite. The new UT demonstrates the issue.
> Futhermore,
> {noformat}
> alter table T compact 'major'
> select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
> {noformat}
> produces
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0}
> file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001
> 1 2
> {noformat}
> HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0()
> demonstrating this
> This is because compactor doesn't handle copy_N files either (skips them)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)