[
https://issues.apache.org/jira/browse/HIVE-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052398#comment-17052398
]
Gopal Vijayaraghavan edited comment on HIVE-22977 at 3/5/20, 6:15 PM:
----------------------------------------------------------------------
This is most likely not an optimization & might make read queries worse.
{code}
HIVE_ORC_BASE_DELTA_RATIO("hive.exec.orc.base.delta.ratio", 8, "The ratio
of base writer and\n" +
"delta writer in terms of STRIPE_SIZE and BUFFER_SIZE."),
HIVE_ORC_DELTA_STREAMING_OPTIMIZATIONS_ENABLED("hive.exec.orc.delta.streaming.optimizations.enabled",
false,
"Whether to enable streaming optimizations for ORC delta files. This will
disable ORC's internal indexes,\n" +
"disable compression, enable fast encoding and disable dictionary
encoding."),
{code}
https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2043
The Stripe sizing for the deltas are 8x smaller than the regular base files,
with the assumption that a compactor will go fix it after inserts are done -
merging them would result in the bad striping becoming permanent.
The streaming inserts do not write any ORC indexes for the same reason, to make
streaming faster with the assumption that a compactor will rebuild the
min/max/bloom when it runs in the background asynchronously. Merging stripes
without rebuilding indexes will result in compacted data having no ability to
do predicate push-down.
The 10% of data in deltas can behave under-par for read throughput, but making
these two permanent by running MergeTask instead is probably going to make the
compactor faster and everything else slower.
was (Author: gopalv):
This is most likely not an optimization & might make read queries worse.
The Stripe sizing for the deltas are 8x smaller than the regular base files,
with the assumption that a compactor will go fix it after inserts are done -
merging them would result in the bad striping becoming permanent.
The streaming inserts do not write any ORC indexes for the same reason, to make
streaming faster with the assumption that a compactor will rebuild the
min/max/bloom when it runs in the background asynchronously. Merging stripes
without rebuilding indexes will result in compacted data having no ability to
do predicate push-down.
The 10% of data in deltas can behave under-par for read throughput, but making
these two permanent by running MergeTask instead is probably going to make the
compactor faster and everything else slower.
> Merge delta files instead of running a query in major/minor compaction
> ----------------------------------------------------------------------
>
> Key: HIVE-22977
> URL: https://issues.apache.org/jira/browse/HIVE-22977
> Project: Hive
> Issue Type: Improvement
> Reporter: László Pintér
> Assignee: László Pintér
> Priority: Major
> Attachments: HIVE-22977.01.patch, HIVE-22977.02.patch
>
>
> [Compaction Optimiziation]
> We should analyse the possibility to move a delta file instead of running a
> major/minor compaction query.
> Please consider the following use cases:
> - full acid table but only insert queries were run. This means that no
> delete delta directories were created. Is it possible to merge the delta
> directory contents without running a compaction query?
> - full acid table, initiating queries through the streaming API. If there
> are no abort transactions during the streaming, is it possible to merge the
> delta directory contents without running a compaction query?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)