[
https://issues.apache.org/jira/browse/HIVE-17328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Koifman reassigned HIVE-17328:
-------------------------------------
> Remove special handling for Acid tables wherever possible
> ---------------------------------------------------------
>
> Key: HIVE-17328
> URL: https://issues.apache.org/jira/browse/HIVE-17328
> Project: Hive
> Issue Type: Improvement
> Components: Transactions
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
>
> There are various places in the code that do something like
> if(acid update or delete) {
> do something
> }
> else {
> do something else
> }
> this complicates the code and makes it so that acid code path is not properly
> tested in many new non-acid features or bug fixes.
> Some work to simplify this was done in HIVE-15844.
> SortedDynPartitionOptimizer has some special logic
> ReduceSinkOperator relies on partitioning columns for update/delete be
> UDFToInteger(RecordIdentifier) which is set up in SemanticAnalyzer.
> Consequently SemanticAnalyzer has special logic to set it up.
> FileSinkOperator has some specialization.
> AbstractCorrelationProcCtx makes changes specific to acid writes setting
> hive.optimize.reducededuplication.min.reducer=1
> With acid 2.0 (HIVE-17089) a lot more of it can simplified/removed.
> Generally, Acid Insert follows the same code path as regular insert except
> that the writer in FileSinkOperator is Acid specific.
> So all the specialization is to route Update/Delete events to the right place.
> We can do the U=D+I early in the operator pipeline so that an Update is a
> Hive multi-insert with 1 leg being the Insert leg and the other being the
> Delete leg (like Merge stmt).
> The Delete events themselves don't need to be routed in any particular way if
> we always ship all delete_delta files for each split. This is ok since
> delete events are very small and highly compressible. What is shipped is
> independent of what needs to be loaded into memory.
> This would allow removing almost all special code paths.
> If need be we can also have the compactor rewrite the delete files so that
> the name of the file matches the contents and make it as if they were
> bucketed properly and use it reduce what needs to be shipped for each split.
> This may help with some extreme cases where someone updates 1B rows.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)