[
https://issues.apache.org/jira/browse/HIVE-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799459#comment-13799459
]
Prasanth J commented on HIVE-5483:
----------------------------------
[~ashutoshc] In scenarios where metastore column stats are not available, do
you think we can fallback to file format and see if it exposes column level
statistics? ORC reader provides interface for column statistics. To make it
more generic, I think we can add new interface like StatsProvidingRecordReader,
implementation of which should expose file/column statistics. We can fallback
to this record reader in case if the metastore stats are not available or
stale. Since there are two sources of truth (file and metastore), there are two
possibilities.
1) Check metastore, if column stats are not available fallback to file format.
2) Keep metastore as the only source of truth and make sure its always
consistent with the underlying file format. (currently we don't make sure this
is always consistent).
Another thing that can be fixed is, there are some redundancies in computing
stats with file format vs analyze command. If file format gathers file level
and column level statistics, then analyze command should get it from the file
format instead of computing it which is way cheaper.
> use metastore statistics to optimize max/min/etc. queries
> ---------------------------------------------------------
>
> Key: HIVE-5483
> URL: https://issues.apache.org/jira/browse/HIVE-5483
> Project: Hive
> Issue Type: Improvement
> Reporter: Sergey Shelukhin
> Assignee: Ashutosh Chauhan
> Attachments: HIVE-5483.patch
>
>
> We have discussed this a little bit.
> Hive can answer queries such as select max(c1) from t purely from metastore
> using partition statistics, provided that we know the statistics are up to
> date.
> All data changes (e.g. adding new partitions) currently go thru metastore so
> we can track up-to-date-ness. If they are not up-to-date, the queries will
> have to read data (at least for outdated partitions) until someone runs
> analyze table. We can also analyze new partitions after add, if that is
> configured/specified in the command.
--
This message was sent by Atlassian JIRA
(v6.1#6144)