[
https://issues.apache.org/jira/browse/HIVE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302891#comment-14302891
]
Xin Hao commented on HIVE-9560:
-------------------------------
For example, we have an ORC table named 'item'.
(a) Before running 'analyze table item compute statistics;',
the 'rawDataSize' was '884720592'.
The result of 'describe extended item':
Detailed Table Information Table(tableName:item, dbName:bigbenchorc,
owner:root, createTime:1421984899, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:i_item_sk, type:bigint,
comment:null), FieldSchema(name:i_item_id, type:string, comment:null),
FieldSchema(name:i_rec_start_date, type:string, comment:null),
FieldSchema(name:i_rec_end_date, type:string, comment:null),
FieldSchema(name:i_item_desc, type:string, comment:null),
FieldSchema(name:i_current_price, type:double, comment:null),
FieldSchema(name:i_wholesale_cost, type:double, comment:null),
FieldSchema(name:i_brand_id, type:int, comment:null), FieldSchema(name:i_brand,
type:string, comment:null), FieldSchema(name:i_class_id, type:int,
comment:null), FieldSchema(name:i_class, type:string, comment:null),
FieldSchema(name:i_category_id, type:int, comment:null),
FieldSchema(name:i_category, type:string, comment:null),
FieldSchema(name:i_manufact_id, type:int, comment:null),
FieldSchema(name:i_manufact, type:string, comment:null),
FieldSchema(name:i_size, type:string, comment:null),
FieldSchema(name:i_formulation, type:string, comment:null),
FieldSchema(name:i_color, type:string, comment:null), FieldSchema(name:i_units,
type:string, comment:null), FieldSchema(name:i_container, type:string,
comment:null), FieldSchema(name:i_manager_id, type:int, comment:null),
FieldSchema(name:i_product_name, type:string, comment:null)],
location:hdfs://bhx1:8020/user/hive/warehouse/bigbenchorc.db/item,
inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=4, transient_lastDdlTime=1421984899,
COLUMN_STATS_ACCURATE=true, totalSize=83267548, numRows=563518,
rawDataSize=884720592}, viewOriginalText:null, viewExpandedText:null,
tableType:MANAGED_TABLE)
Time taken: 0.527 seconds, Fetched: 24 row(s)
(b)After running 'analyze table TABLE_NAME compute statistics;'
the 'rawDataSize' will be changed to '0',
The result of 'describe extended item':
Detailed Table Information Table(tableName:item, dbName:bigbenchorc,
owner:root, createTime:1421984899, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:i_item_sk, type:bigint,
comment:null), FieldSchema(name:i_item_id, type:string, comment:null),
FieldSchema(name:i_rec_start_date, type:string, comment:null),
FieldSchema(name:i_rec_end_date, type:string, comment:null),
FieldSchema(name:i_item_desc, type:string, comment:null),
FieldSchema(name:i_current_price, type:double, comment:null),
FieldSchema(name:i_wholesale_cost, type:double, comment:null),
FieldSchema(name:i_brand_id, type:int, comment:null), FieldSchema(name:i_brand,
type:string, comment:null), FieldSchema(name:i_class_id, type:int,
comment:null), FieldSchema(name:i_class, type:string, comment:null),
FieldSchema(name:i_category_id, type:int, comment:null),
FieldSchema(name:i_category, type:string, comment:null),
FieldSchema(name:i_manufact_id, type:int, comment:null),
FieldSchema(name:i_manufact, type:string, comment:null),
FieldSchema(name:i_size, type:string, comment:null),
FieldSchema(name:i_formulation, type:string, comment:null),
FieldSchema(name:i_color, type:string, comment:null), FieldSchema(name:i_units,
type:string, comment:null), FieldSchema(name:i_container, type:string,
comment:null), FieldSchema(name:i_manager_id, type:int, comment:null),
FieldSchema(name:i_product_name, type:string, comment:null)],
location:hdfs://bhx1:8020/user/hive/warehouse/bigbenchorc.db/item,
inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=4, transient_lastDdlTime=1421984899,
COLUMN_STATS_ACCURATE=true, totalSize=83267548, numRows=563518,
rawDataSize=884720592}, viewOriginalText:null, viewExpandedText:null,
tableType:MANAGED_TABLE)
Time taken: 0.527 seconds, Fetched: 24 row(s)
> When hive.stats.collect.rawdatasize=true, 'rawDataSize' for an ORC table will
> result in value '0' after running 'analyze table TABLE_NAME compute
> statistics;'
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-9560
> URL: https://issues.apache.org/jira/browse/HIVE-9560
> Project: Hive
> Issue Type: Bug
> Reporter: Xin Hao
>
> When hive.stats.collect.rawdatasize=true, 'rawDataSize' for an ORC table will
> result in value '0' after running 'analyze table TABLE_NAME compute
> statistics;'
> Reproduce step:
> (1) set hive.stats.collect.rawdatasize=true;
> (2) Generate an ORC table in hive, and the value of its 'rawDataSize' is NOT
> zero.
> You can find the value of 'rawDataSize' (NOT zero) by executing 'describe
> extended TABLE_NAME;'
> (4) Execute 'analyze table TABLE_NAME compute statistics;'
> (5) Execute 'describe extended TABLE_NAME;' again, and you will find that
> the value of 'rawDataSize' will be changed to '0'.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)