[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372482#comment-16372482
 ] 

Zoltan Haindrich commented on HIVE-18743:
-----------------------------------------

probably there was a time when this was more relevant...but it seems like this 
thing causes more problem than it fixes - and it just keeps coming back :)
so it might be easier to address the original problem differently ...if there 
is any....I think the original intention was that something wanted to skip the 
stats update....but afaik currently hive also sets explicitly this flag to 
prevent the metastore from interfering....so I guess that should leave the 
mostly unindended codepath-s ending up triggering this feature :D

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-18743
>                 URL: https://issues.apache.org/jira/browse/HIVE-18743
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Kolbasov
>            Priority: Major
>         Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>                                              boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
>     if (tbl.getPartitionKeysSize() == 0) {
>       // Update stats only when unpartitioned
>       FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>       return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
>     } else {
>       return false;
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to