[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Apache Wiki Mon, 28 Jun 2010 12:32:24 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: initial version of GenericUDAF tutorial.
http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy

--------------------------------------------------

New page:
= Writing GenericUDAFs: A Tutorial =

User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate 
advanced data-processing into Hive. Hive allows two varieties of UDAFs: simple 
and generic. Simple UDAFs, as the name implies, are rather simple to write, but 
incur performance penalties because of the use of 
[[http://java.sun.com/docs/books/tutorial/reflect/index.html | Java 
Reflection]], and do not allow features such as variable-length argument lists. 
Generic UDAFs allow all these features, but are perhaps not quite as intuitive 
to write as Simple UDAFs.

This tutorial walks through the development of the `histogram()` UDAF, which 
computes a histogram with a fixed, user-specified number of bins, using a 
constant amount of memory and time linear in the input size. It demonstrates a 
number of features of Generic UDAFs, such as a complex return type (an array of 
structures), and type checking on the input. The assumption is that the reader 
wants to write a UDAF for eventual submission to the Hive open-source project, 
so steps such as modifying the function registry in Hive and writing `.q` tests 
are also included. If you just want to write a UDAF, debug and deploy locally, 
see [[http://wiki.apache.org/hadoop/Hive/HivePlugins | this page]].

'''NOTE:''' In this tutorial, we walk through the creation of a `histogram()` 
function. In future (as of July 2010) releases of Hive, this will appear as the 
built-in function `histogram_numeric()`.

<<TableOfContents(2)>>

== Preliminaries ==

Make sure you have the latest Hive trunk by running `svn up` in your Hive 
directory. More detailed instructions on downloading and setting up Hive can be 
found at [[http://wiki.apache.org/hadoop/Hive/GettingStarted | Getting Started 
]]. Your local copy of Hive should work by running `build/dist/bin/hive` from 
the Hive root directory, and you should have some tables of data loaded into 
your local instance for testing whatever UDAF you have in mind. For this 
example, assume that a table called `normal` exists with a single `double` 
column called `val`, containing a large number of random number drawn from the 
standard normal distribution.

The files we will be editing or creating are as follows, relative to the Hive 
root:

|| 
`ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java` 
|||| the main source file, to be created by you.||
|| `ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java` |||| the 
function registry source file, to be edited by you to register our new 
`histogram()` UDAF into Hive's built-in function list.||
|| `ql/src/test/queries/clientpositive/udaf_histogram.q` |||| a file of sample 
queries for testing `histogram()` on sample data, to be created by you.||
|| `ql/src/test/results/clientpositive/udaf_histogram.q.out` |||| the expected 
output from your sample queries, to be created by `ant` in a later step. ||
|| `ql/src/test/results/clientpositive/show_functions.q.out` |||| the expected 
output from the SHOW FUNCTIONS Hive query. Since we're adding a new 
`histogram()` function, this expected output will change to reflect the new 
function. This file will be modified by `ant` in a later step. ||

== Writing the source ==

As stated above, create a new file called 
`ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`, 
relative to the Hive root directory. Please see the 
`ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java`
 for a detailed example of a UDAF.

== Modifying the function registry ==

== Creating the tests ==

== Compiling, testing ==

= Checklist for open source submission =

 * Create an account on the [[ https://issues.apache.org/jira/browse/HIVE | 
Hive JIRA ]], create an issue for your new patch under the `Query Processor` 
component. Solicit discussion, incorporate feedback.
 * Create your UDAF, integrate it into your local Hive copy.
 * Run `ant package` from the Hive root to compile Hive and your new UDAF.
 * Create `.q` tests and their corresponding `.q.out` output.
 * Modify the function registry if adding a new function.
 * Run `ant checkstyle`, ensure that your source files conform to the coding 
convention.
 * Run `ant test`, ensure that tests pass.
 * Run `svn up`, ensure no conflicts with the main repository.
 * Run `svn add` for whatever new files you have created.
 * Ensure that you have added `.q` and `.q.out` tests.
 * Ensure that you have run the `.q` tests for all new functionality.
 * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
 * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where NNNN 
is the issue number the JIRA has assigned to you.
 * Attach your file to the JIRA issue, describe your patch in the comments 
section.
 * Ask for a code review in the comments.
 * Click '''Submit patch''' on your issue after you have completed the steps 
above.
 * It is also advisable to '''watch''' your issue to monitor new comments.

[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Reply via email to