[ 
https://issues.apache.org/jira/browse/HIVE-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531173#comment-14531173
 ] 

Dongwook Kwon commented on HIVE-10631:
--------------------------------------

This is just test case to explain the impact of this bug especially with 
HIVE-6727

Let's say I have 2 dummy existing locations for table partitions as the 
followings.

s3://bucket/table/dummy/partition=1/
s3://bucket/table/dummy/partition=2/
s3://bucket/table/dummy/partition=3/

s3://bucket/warehouse/dummy/partition=1/
s3://bucket/warehouse/dummy/partition=2/
s3://bucket/warehouse/dummy/partition=3/

And I created external table name "dummy" with table location 
"s3://bucket/table/dummy/" and set 
"hive.metastore.warehouse.dir=s3://bucket/warehouse/" and 
"hive.stats.autogather=true"

When this dummy table is created, HiveMetaStore scans warehouse directory 
recursively not table location and do nothing with data collected. 
So some edge cases when external table is created with above conditions, 
especially for large partitions, it takes quite of time than 
"hive.stats.autogather" off. 

> create_table_core method has invalid update for Fast Stats
> ----------------------------------------------------------
>
>                 Key: HIVE-10631
>                 URL: https://issues.apache.org/jira/browse/HIVE-10631
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>    Affects Versions: 0.13.0, 1.0.0
>            Reporter: Dongwook Kwon
>            Priority: Minor
>
> HiveMetaStore.create_table_core method calls 
> MetaStoreUtils.updateUnpartitionedTableStatsFast when hive.stats.autogather 
> is on, however for partitioned table, this updateUnpartitionedTableStatsFast 
> call scanning warehouse dir and doesn't seem to use it. 
> "Fast Stats" was implemented by HIVE-3959
> https://github.com/apache/hive/blob/branch-1.0/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L1363
> From create_table_core method
> {code}
>         if (HiveConf.getBoolVar(hiveConf, 
> HiveConf.ConfVars.HIVESTATSAUTOGATHER) &&
>             !MetaStoreUtils.isView(tbl)) {
>           if (tbl.getPartitionKeysSize() == 0)  { // Unpartitioned table
>             MetaStoreUtils.updateUnpartitionedTableStatsFast(db, tbl, wh, 
> madeDir);
>           } else { // Partitioned table with no partitions.
>             MetaStoreUtils.updateUnpartitionedTableStatsFast(db, tbl, wh, 
> true);
>           }
>         }
> {code}
> Particularly Line 1363: // Partitioned table with no partitions.
> {code}
> MetaStoreUtils.updateUnpartitionedTableStatsFast(db, tbl, wh, true);
> {code}
> This call ends up calling Warehouse.getFileStatusesForUnpartitionedTable and 
> do nothing in MetaStoreUtils.updateUnpartitionedTableStatsFast method due to 
> newDir flag is always true
> Impact of this bug is minor with HDFS warehouse 
> location(hive.metastore.warehouse.dir), it could be big with S3 warehouse 
> location especially for large existing partitions.
> Also the impact is heighten with HIVE-6727 when warehouse location is S3, 
> basically it could scan wrong S3 directory recursively and do nothing with 
> it. I will add more detail of cases in comments



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to