[ https://issues.apache.org/jira/browse/HIVE-23054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107366#comment-17107366 ]
David Mollitor commented on HIVE-23054: --------------------------------------- [~jcamachorodriguez] [~ashutoshc] > Capture Total Byte Size in Column Statistics > -------------------------------------------- > > Key: HIVE-23054 > URL: https://issues.apache.org/jira/browse/HIVE-23054 > Project: Hive > Issue Type: Improvement > Components: CBO, Statistics > Reporter: David Mollitor > Priority: Major > > Store a counter in HMS column statics for the total number of bytes (raw) in > each column. > Right now, there is no good way to merge the average column length when > performing an INSERT statement into a table. Right now, the code just > selects the maximum value, however, if inserting a single records with a long > length (128 bytes) into a table that has millions of strings with an average > length of 4, the average length for the entire data set gets boosted to 128. > {code:java} > aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), > newData.getAvgColLen())); > {code} > https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34 > Store the total raw size of all the data in each column. Between the total > raw size, and the average length, one can compute the real average length > when merging the existing data and the newly inserted data. -- This message was sent by Atlassian Jira (v8.3.4#803005)