[ 
https://issues.apache.org/jira/browse/HIVE-21451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Gumashta updated HIVE-21451:
------------------------------------
    Description: 
The transactional files written in hive have each row decorated with 
{{ROW__ID}} column. However, when we bring in files using {{LOAD DATA...}} 
command to the transactional tables, they do not have these metadata columns 
(in Hive ACID parlance, these are called original files). These original files 
are decorated with an inferred {{ROW__ID}} generated while reading these. 
However, after these are compacted, the {{ROW__ID}} metadata column, becomes 
part of the file itself.

To determine if a file is original or not, currently we use check for the 
presence of {{hive.acid.key.index}}. For query based compaction, currently we 
do not write {{hive.acid.key.index}} (HIVE-21165). This means, there is a 
possibility that that even after compaction, they get treated as original files.

Irrespective of HIVE-21165, we should avoid {{hive.acid.key.index}} to decide 
whether the file is original or not, and instead look for the presence of 
{{ROW__ID}} to do that. {{hive.acid.key.index}} should be treated as a 
performance optimization, as it was seemingly meant to be.

  was:
The transactional files written in hive have each row decorated with ROW__ID 
column. However, when we bring in files using LOAD DATA... command to the 
transactional tables, they do not have these metadata columns (in Hive ACID 
parlance, these are called original files). These original files are decorated 
with an inferred ROW__ID generated while reading these. However, after these 
are compacted, the ROW__ID metadata column, becomes part of the file itself.

To determine if a file is original or not, currently we use check for the 
presence of hive.acid.key.index. For query based compaction, currently we do 
not write hive.acid.key.index (HIVE-21165). This means, there is a possibility 
that that even after compaction, they get treated as original files.

Irrespective of HIVE-21165, we should avoid hive.acid.key.index to decide 
whether the file is original or not, and instead look for the presence of 
ROW__ID to do that. hive.acid.key.index should be treated as a performance 
optimization, as it was seemingly meant to be.


> ACID: Avoid using hive.acid.key.index to determine if the file is original or 
> not
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-21451
>                 URL: https://issues.apache.org/jira/browse/HIVE-21451
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Transactions
>    Affects Versions: 3.1.1
>            Reporter: Vaibhav Gumashta
>            Priority: Major
>
> The transactional files written in hive have each row decorated with 
> {{ROW__ID}} column. However, when we bring in files using {{LOAD DATA...}} 
> command to the transactional tables, they do not have these metadata columns 
> (in Hive ACID parlance, these are called original files). These original 
> files are decorated with an inferred {{ROW__ID}} generated while reading 
> these. However, after these are compacted, the {{ROW__ID}} metadata column, 
> becomes part of the file itself.
> To determine if a file is original or not, currently we use check for the 
> presence of {{hive.acid.key.index}}. For query based compaction, currently we 
> do not write {{hive.acid.key.index}} (HIVE-21165). This means, there is a 
> possibility that that even after compaction, they get treated as original 
> files.
> Irrespective of HIVE-21165, we should avoid {{hive.acid.key.index}} to decide 
> whether the file is original or not, and instead look for the presence of 
> {{ROW__ID}} to do that. {{hive.acid.key.index}} should be treated as a 
> performance optimization, as it was seemingly meant to be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to