Aihua Xu created HIVE-20079: ------------------------------- Summary: Populate more accurate rawDataSize for parquet format Key: HIVE-20079 URL: https://issues.apache.org/jira/browse/HIVE-20079 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 2.0.0 Reporter: Aihua Xu Assignee: Aihua Xu
Run the following queries and you will see the raw data for the table is 4 (that is the number of fields) incorrectly. We need to populate correct data size so data can be split properly. {noformat} SET hive.stats.autogather=true; CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); DESC FORMATTED parquet_stats; {noformat} {noformat} Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 2 rawDataSize 4 totalSize 373 transient_lastDdlTime 1530660523 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)