[ 
https://issues.apache.org/jira/browse/HIVE-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799459#comment-13799459
 ] 

Prasanth J commented on HIVE-5483:
----------------------------------

[~ashutoshc] In scenarios where metastore column stats are not available, do 
you think we can fallback to file format and see if it exposes column level 
statistics? ORC reader provides interface for column statistics. To make it 
more generic, I think we can add new interface like StatsProvidingRecordReader, 
implementation of which should expose file/column statistics. We can fallback 
to this record reader in case if the metastore stats are not available or 
stale. Since there are two sources of truth (file and metastore), there are two 
possibilities. 
1) Check metastore, if column stats are not available fallback to file format.
2) Keep metastore as the only source of truth and make sure its always 
consistent with the underlying file format. (currently we don't make sure this 
is always consistent). 

Another thing that can be fixed is, there are some redundancies in computing 
stats with file format vs analyze command. If file format gathers file level 
and column level statistics, then analyze command should get it from the file 
format instead of computing it which is way cheaper. 

> use metastore statistics to optimize max/min/etc. queries
> ---------------------------------------------------------
>
>                 Key: HIVE-5483
>                 URL: https://issues.apache.org/jira/browse/HIVE-5483
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Ashutosh Chauhan
>         Attachments: HIVE-5483.patch
>
>
> We have discussed this a little bit.
> Hive can answer queries such as select max(c1) from t purely from metastore 
> using partition statistics, provided that we know the statistics are up to 
> date.
> All data changes (e.g. adding new partitions) currently go thru metastore so 
> we can track up-to-date-ness. If they are not up-to-date, the queries will 
> have to read data (at least for outdated partitions) until someone runs 
> analyze table. We can also analyze new partitions after add, if that is 
> configured/specified in the command.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to