[
https://issues.apache.org/jira/browse/HIVE-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805650#comment-13805650
]
Prasanth J commented on HIVE-5632:
----------------------------------
[~ehans] Thanks for taking a look at this patch. I will address your review
comments in the next patch.
Regarding your question above,
ORC already stores hierarchical min/max metadata. At the lowest level, ORC
stores min/max for every 10,000 rows (called as rowgroups). The size of the
rowgroup can be configured using the table property "orc.row.index.stride". At
a higher level, HIVE-5562 adds min/max metadata to stripe level. There is also
file level min/max values as well at the file footer.
Stripe levels stats are stored in file footer, stripes that doesn't satisfy the
predicates can be skipped while computing the splits. But for skipping at
rowgroup level each stripe has to be read and kept in-memory. Since we read
entire stripe to memory, I am not sure if adding additional level of min/max
metadata (1 million rows) will be beneficial as skips happens in-memory.
Both rowgroup elimination and stripe elimination will be turned on using "SET
hive.optimize.index.filter=true;" hive config.
> Eliminate splits based on SARGs using stripe statistics in ORC
> --------------------------------------------------------------
>
> Key: HIVE-5632
> URL: https://issues.apache.org/jira/browse/HIVE-5632
> Project: Hive
> Issue Type: Improvement
> Affects Versions: 0.13.0
> Reporter: Prasanth J
> Assignee: Prasanth J
> Labels: orcfile
> Attachments: HIVE-5632.1.patch.txt, HIVE-5632.2.patch.txt,
> orc_split_elim.orc
>
>
> HIVE-5562 provides stripe level statistics in ORC. Stripe level statistics
> combined with predicate pushdown in ORC (HIVE-4246) can be used to eliminate
> the stripes (thereby splits) that doesn't satisfy the predicate condition.
> This can greatly reduce unnecessary reads.
--
This message was sent by Atlassian JIRA
(v6.1#6144)