[ 
https://issues.apache.org/jira/browse/HIVE-23054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107366#comment-17107366
 ] 

David Mollitor commented on HIVE-23054:
---------------------------------------

[~jcamachorodriguez] [~ashutoshc]

> Capture Total Byte Size in Column Statistics
> --------------------------------------------
>
>                 Key: HIVE-23054
>                 URL: https://issues.apache.org/jira/browse/HIVE-23054
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Statistics
>            Reporter: David Mollitor
>            Priority: Major
>
> Store a counter in HMS column statics for the total number of bytes (raw) in 
> each column.
> Right now, there is no good way to merge the average column length when 
> performing an INSERT statement into a table.  Right now, the code just 
> selects the maximum value, however, if inserting a single records with a long 
> length (128 bytes) into a table that has millions of strings with an average 
> length of 4, the average length for the entire data set gets boosted to 128.
> {code:java}
> aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), 
> newData.getAvgColLen()));
> {code}
> https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34
> Store the total raw size of all the data in each column.  Between the total 
> raw size, and the average length, one can compute the real average length 
> when merging the existing data and the newly inserted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to