[
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Koifman updated HIVE-13479:
----------------------------------
Remaining Estimate: 160h (was: 672h)
Original Estimate: 160h (was: 672h)
> Relax sorting requirement in ACID tables
> ----------------------------------------
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
> Issue Type: New Feature
> Components: Transactions
> Affects Versions: 1.2.0
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
> Original Estimate: 160h
> Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary
> key. This is that base + delta files can be efficiently sort/merged to
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria
> which can be useful. One example is using dynamic partition insert (which
> also occurs for update/delete SQL). This may create lots of writers
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not
> require any particular sort on Acid tables. One way to do that is to treat
> each update event as an Insert (new internal PK) + delete (old PK). Delete
> events are very small since they just need to contain PKs. So the hash table
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)