histogram() UDAF for a numerical column
---------------------------------------

                 Key: HIVE-1397
                 URL: https://issues.apache.org/jira/browse/HIVE-1397
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
    Affects Versions: 0.6.0
            Reporter: Mayank Lahiri
            Assignee: Mayank Lahiri
             Fix For: 0.6.0


A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
short, double, long, etc.) column. The result is returned as a map of (x,y) 
histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
The algorithm is currently adapted from "A streaming parallel decision tree 
algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space proportional 
to the number of histogram bins specified. It has no approximation guarantees, 
but seems to work well when there is a lot of data and a large number (e.g. 
50-100) of histogram bins specified.

A typical call might be:

SELECT histogram(val, 10) FROM some_table;

where the result would be a histogram with 10 bins, returned as a Hive map 
object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to