[ 
https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-17328:
-------------------------------------


> Remove special handling for Acid tables wherever possible
> ---------------------------------------------------------
>
>                 Key: HIVE-17328
>                 URL: https://issues.apache.org/jira/browse/HIVE-17328
>             Project: Hive
>          Issue Type: Improvement
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> There are various places in the code that do something like 
> if(acid update or delete) {
>  do something
> }
> else {
> do something else
> }
> this complicates the code and makes it so that acid code path is not properly 
> tested in many new non-acid features or bug fixes.
> Some work to simplify this was done in HIVE-15844.
> SortedDynPartitionOptimizer has some special logic
> ReduceSinkOperator relies on partitioning columns for update/delete be 
> UDFToInteger(RecordIdentifier) which is set up in SemanticAnalyzer.  
> Consequently SemanticAnalyzer has special logic to set it up.
> FileSinkOperator has some specialization.
> AbstractCorrelationProcCtx makes changes specific to acid writes setting 
> hive.optimize.reducededuplication.min.reducer=1
> With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed.
> Generally, Acid Insert follows the same code path as regular insert except 
> that the writer in FileSinkOperator is Acid specific.
> So all the specialization is to route Update/Delete events to the right place.
> We can do the U=D+I early in the operator pipeline so that an Update is a 
> Hive multi-insert with 1 leg being the Insert leg and the other being the 
> Delete leg (like Merge stmt).
> The Delete events themselves don't need to be routed in any particular way if 
> we always ship all delete_delta files for each split.  This is ok since 
> delete events are very small and highly compressible.  What is shipped is 
> independent of what needs to be loaded into memory.
> This would allow removing almost all special code paths.
> If need be we can also have the compactor rewrite the delete files so that 
> the name of the file matches the contents and make it as if they were 
> bucketed properly and use it reduce what needs to be shipped for each split.  
> This may help with some extreme cases where someone updates 1B rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to