[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795933#comment-16795933
 ] 

Abhishek Somani commented on HIVE-13479:
----------------------------------------

What I meant is now that ACID v2 has been implemented, do we plan to work on 
relaxing the sorting requirement? As far as I know, we still enforce that the 
rows be sorted on the acid columns(row id), and this is done so that the reader 
can sort-merge the delete events with the insert events while reading. Isn't 
that right?

If so, it seems the only way to have data sorted on another column specified by 
the user seems to be to initially insert the data with ordering on that column, 
so that the data is sorted BOTH on the acid columns as well as user specified 
column.

If however we were able to relax the requirement that data HAS to be sorted on 
the acid columns, we could utilize something like compaction to sort the data 
on user desired columns in the background. Theoretically one could do such 
sorting in compaction even today, but if the sorting requirement is not 
relaxed, we will need to sort both on row ids and user-column, for which one 
would need the compaction to behave as an insert overwrite and generate new row 
ids so that the data is sorted on both the (new)row id columns as well as the 
user specified column, which would be good to avoid.

Have I understood this correct?

> Relax sorting requirement in ACID tables
> ----------------------------------------
>
>                 Key: HIVE-13479
>                 URL: https://issues.apache.org/jira/browse/HIVE-13479
>             Project: Hive
>          Issue Type: New Feature
>          Components: Transactions
>    Affects Versions: 1.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to