[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-13479:
----------------------------------
    Remaining Estimate: 160h  (was: 672h)
     Original Estimate: 160h  (was: 672h)

> Relax sorting requirement in ACID tables
> ----------------------------------------
>
>                 Key: HIVE-13479
>                 URL: https://issues.apache.org/jira/browse/HIVE-13479
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>    Affects Versions: 1.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to