histogram() UDAF for a numerical column ---------------------------------------
Key: HIVE-1397 URL: https://issues.apache.org/jira/browse/HIVE-1397 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.6.0 Reporter: Mayank Lahiri Assignee: Mayank Lahiri Fix For: 0.6.0 A histogram() UDAF to generate an approximate histogram of a numerical (byte, short, double, long, etc.) column. The result is returned as a map of (x,y) histogram pairs, and can be plotted in Gnuplot using impulses (for example). The algorithm is currently adapted from "A streaming parallel decision tree algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space proportional to the number of histogram bins specified. It has no approximation guarantees, but seems to work well when there is a lot of data and a large number (e.g. 50-100) of histogram bins specified. A typical call might be: SELECT histogram(val, 10) FROM some_table; where the result would be a histogram with 10 bins, returned as a Hive map object. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.