[jira] [Updated] (HIVE-18814) Support Add Partition For Acid tables

Eugene Koifman (JIRA) Wed, 07 Mar 2018 14:43:01 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Koifman updated HIVE-18814:
----------------------------------
    Description: 
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual%2BDDL#LanguageManualDDL-AddPartitions]

Add Partition command creates a {{Partition}} metadata object and sets the 
location to the directory containing data files.

In current master (Hive 3.0), Add partition on an acid table doesn't fail and 
at read time the data is decorated with row__id but the original transaction is 
0.  I suspect in earlier Hive versions this will throw or return no data.
Since this new partition didn't have data before, assigning txnid:0 isn't going 
to generate duplicate IDs but it could violate Snapshot Isolation in multi stmt 
txns.  Suppose txnid:7 runs {{select * from T}}.  Then txnid:8 adds a partition 
to T.  Now if txnid:7 runs the same query again, it will see the data in the 
new partition.

 

One option is follow Load Data approach and create a new delta_x_x/ and 
move/copy the data there.

 

Another is to allocate a new writeid and save it in Partition metadata.  This 
could then be used to decorate data with ROW__IDs.  This avoids move/copy but 
retains data "outside" of the table tree which make it more likely that this 
data will be modified in some way which can really break things if done after 
and SQL update/delete on this data have happened. 

 

It performs no validations on add (except for partition spec) so any file with 
any format can be added.  It allows add to bucketed tables as well.

Seems like a very dangerous command.  Maybe a better option is to block it and 
advise using Load Data.  Alternatively, make this do Add partition metadata op 
followed by Load Data. 

 

 

  was:
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual%2BDDL#LanguageManualDDL-AddPartitions]

Add Partition command creates a {{Partition}} metadata object and sets the 
location to the directory containing data files.

In current master (Hive 3.0), Add partition on an acid table doesn't fail and 
at read time the data is decorated with row__id but the original transaction is 
0.  I suspect in earlier Hive versions this will throw or return no data.

 

One option is follow Load Data approach and create a new delta_x_x/ and 
move/copy the data there.

 

Another is to allocate a new writeid and save it in Partition metadata.  This 
could then be used to decorate data with ROW__IDs.  This avoids move/copy but 
retains data "outside" of the table tree which make it more likely that this 
data will be modified in some way which can really break things if done after 
and SQL update/delete on this data have happened. 

 

It performs no validations on add (except for partition spec) so any file with 
any format can be added.  It allows add to bucketed tables as well.

Seems like a very dangerous command.  Maybe a better option is to block it and 
advise using Load Data.  Alternatively, make this do Add partition metadata op 
followed by Load Data. 

 

 


> Support Add Partition For Acid tables
> -------------------------------------
>
>                 Key: HIVE-18814
>                 URL: https://issues.apache.org/jira/browse/HIVE-18814
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>         Attachments: HIVE-18814.wip.patch
>
>
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual%2BDDL#LanguageManualDDL-AddPartitions]
> Add Partition command creates a {{Partition}} metadata object and sets the 
> location to the directory containing data files.
> In current master (Hive 3.0), Add partition on an acid table doesn't fail and 
> at read time the data is decorated with row__id but the original transaction 
> is 0.  I suspect in earlier Hive versions this will throw or return no data.
> Since this new partition didn't have data before, assigning txnid:0 isn't 
> going to generate duplicate IDs but it could violate Snapshot Isolation in 
> multi stmt txns.  Suppose txnid:7 runs {{select * from T}}.  Then txnid:8 
> adds a partition to T.  Now if txnid:7 runs the same query again, it will see 
> the data in the new partition.
>  
> One option is follow Load Data approach and create a new delta_x_x/ and 
> move/copy the data there.
>  
> Another is to allocate a new writeid and save it in Partition metadata.  This 
> could then be used to decorate data with ROW__IDs.  This avoids move/copy but 
> retains data "outside" of the table tree which make it more likely that this 
> data will be modified in some way which can really break things if done after 
> and SQL update/delete on this data have happened. 
>  
> It performs no validations on add (except for partition spec) so any file 
> with any format can be added.  It allows add to bucketed tables as well.
> Seems like a very dangerous command.  Maybe a better option is to block it 
> and advise using Load Data.  Alternatively, make this do Add partition 
> metadata op followed by Load Data. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18814) Support Add Partition For Acid tables

Reply via email to