[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776925#comment-16776925 ]
BELUGA BEHR commented on HIVE-20079: ------------------------------------ This patch is still incorrect. It's actually producing the same wrong numbers as before, though, perhaps a bit more efficiently. {code} totalSize += block.getTotalByteSize(); {code} {{getTotalByteSize()}} is not the same as "rawDataSize". bq. rawDataSize—Approximate size of data in memory https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hos_tuning.html That means that for a single table row with 4 INTs (values: 1,2,3,4) I would expect a rawDataSize of (4 bytes x 4 Java ints) = 32 bytes. However, Parquet would report this as 4 bytes because of the way that Parquet packs these numbers internal to its implementation. Hive should look at the row counts and multiply it by the row data types. The {{AbstractSerDe}} class should have code to facilitate all of this like {{readNumber()}} {{readString(int bumBytes}}, etc that can be called as each row is read. > Populate more accurate rawDataSize for parquet format > ----------------------------------------------------- > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 2.0.0 > Reporter: Aihua Xu > Assignee: Antal Sinkovits > Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles 1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)