David Mollitor created HIVE-23054:
-------------------------------------

             Summary: Capture Total Byte Size in Column Statistics
                 Key: HIVE-23054
                 URL: https://issues.apache.org/jira/browse/HIVE-23054
             Project: Hive
          Issue Type: Improvement
          Components: CBO, Statistics
            Reporter: David Mollitor


Store a counter in HMS column statics for the total number of bytes (raw) in 
each column.

Right now, there is no good way to merge the average column length when 
performing an INSERT statement into a table.  Right now, the code just selects 
the maximum value, however, if inserting a single records with a long length 
(128 bytes) into a table that has millions of strings with an average length of 
4, the average length for the entire data set gets boosted to 128.

{code:java}
aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), 
newData.getAvgColLen()));
{code}

https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34

Store the total raw size of all the data in each column.  Between the total raw 
size, and the average length, one can compute the real average length when 
merging the exiting data and the newly inserted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to