[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Apache Wiki Thu, 19 Aug 2010 11:57:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: finished the GenericUDAF writing case study.
http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy?action=diff&rev1=3&rev2=4

--------------------------------------------------

    } 
  }}}
  
+ What do all these functions do? The following is a brief summary of each 
function, in (roughly) chronological order of being called. It's '''very''' 
important to remember that the computation of your aggregation must be 
arbitrarily divisible over the data. Think of it like writing a 
divide-and-conquer algorithm where the partitioning of the data is completely 
out of your control and handled by Hive. More formally, given any subset of the 
input rows, you should be able to compute a partial result, and also be able to 
merge any pair of partial results into another partial result. This naturally 
makes it difficult to port over many existing algorithms, but should guarantee 
researchers jobs for quite some time.
+ 
+ || '''Function''' || '''Purpose''' ||
+ || init || Called by Hive to initialize an instance of your UDAF evaluator 
class. ||
+ || getNewAggregationBuffer || Return an object that will be used to store 
temporary aggregation results. ||
+ || iterate || Process a new row of data into the aggregation buffer ||
+ || terminatePartial || Return the contents of the current aggregation in a 
serializable way ||
+ || merge || Merge a partial aggregation returned by '''terminatePartial''' 
into the current aggregation ||
+ || terminate || Return the final result of the aggregation to Hive ||
+ 
+ For writing the `histogram()` function, the following is the strategy that 
was adopted.
+ 
+ ==== getNewAggregationBuffer ====
+ 
+ The aggregation buffer for a histogram is a list of (x,y) pairs that 
represent the histogram's bin centers and heights. In addition, the aggregation 
buffer also stores two integers with the maximum number of bins (a 
user-specified parameter), and the current number of bins used. The aggregation 
buffer is initialized to a 'not ready' state with the number of bins set to 0. 
This is because Hive makes no distinction between a constant parameter supplied 
to a UDAF and a column from a table; thus, we have no way of knowing how many 
bins the user wants in their histogram until the first call to `iterate()`.
+ 
+ ==== iterate ====
+ 
+ The first thing we do in `iterate()` is to check whether the histogram object 
in our aggregation buffer is initialized. If it is not, we parse our the second 
argument to `iterate()`, which is the number of histogram bins requested by the 
user. We do this exactly once and initialize the histogram object. Note that 
error checking is performed here -- if the user supplied a negative number or 
zero for the number of histogram bins, a `HiveException` is thrown at this 
point and computation terminates.
+ 
+ Next, we parse out the actual input data item (a number) and add it to our 
histogram estimation in the aggregation buffer. See the 
`GenericUDAFHistogramNumeric.java` file for details on the heuristic used to 
construct a histogram.
+ 
+ ==== terminatePartial ====
+ 
+ The current histogram approximation is serialized as a list of 
`DoubleWritable` objects. The first two doubles in the list indicate the 
maximum number of histogram bins specified by the user and number of bins 
current used. The remaining entries are (x,y) pairs from the current histogram 
approximation.
+ 
+ ==== merge ====
+ 
+ At this point, we have a (possibly uninitialized) histogram estimation, and 
have been requested to merge it with another estimation performed on a separate 
subset of the rows. If '''N''' is the number of histogram bins specified by the 
user, the current heuristic first builds a histogram with all '''2N''' bins 
from both estimations, and then iteratively merges the closest pair of bins 
until only '''N''' bins remain.
+ 
+ ==== terminate ====
+ 
+ The final return type from the `histogram()` function is an array of (x,y) 
pairs representing histogram bin centers and heights. These can be 
`explode()`ed into a separate table, or parsed using a script and passed to 
Gnuplot (for example) to visualize the histogram.
+ 
  == Modifying the function registry ==
  
+ Once the code for the UDAF has been written and the source file placed in 
`ql/src/java/org/apache/hadoop/hive/ql/udf/generic`, it's time to modify the 
function registry and incorporate the new function into Hive's list of 
functions. This simply involves editing 
`ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java` to import 
your UDAF class and register it's name.
+ 
+ Please note that you will have to run the following command to update the 
output of the `show functions` Hive call:
+ 
+ {{{ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q 
-Doverwrite=true}}}
+ 
+ == Compiling and running ==
+ 
+ {{{
+ ant package
+ build/dist/bin/hive
+ }}}
+ 
  == Creating the tests ==
  
- == Compiling, testing ==
+ System-level tests consist of writing some sample queries that operate on 
sample data, generating the expected output from the queries, and making sure 
that things don't break in the future in terms of expected output. Note that 
the expected output is passed through `diff` with the actual output from Hive, 
so nondeterministic algorithms will have to compute some sort of statistic and 
then only keep the most significant digits (for example).
+ 
+ These are the simple steps needed for creating test cases for your new 
UDAF/UDF:
+ 
+  1. Create a file in `ql/src/test/queries/clientpositive/udaf_XXXXX.q` where 
`XXXXX` is your UDAF's name.
+  2. Put some queries in the `.q` file -- hopefully enough to cover the full 
range of functionality and special cases.
+  3. For sample data, put your own in `hive/data/files` and load it using 
`LOAD DATA LOCAL INPATH...`, or reuse one of the files already there (grep for 
LOAD in the queries directory to see table names).
+  4. `touch ql/src/test/results/clientpositive/udaf_XXXX.q.out`
+  5. Run the following command to generate the output into the `.q.out` result 
file.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q -Doverwrite=true
+ }}}
+  6. Run the following command to make sure your test runs fine.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q
+ }}}
  
  = Checklist for open source submission =
  
@@ -178, +240 @@

   * Run `ant package` from the Hive root to compile Hive and your new UDAF.
   * Create `.q` tests and their corresponding `.q.out` output.
   * Modify the function registry if adding a new function.
-  * Run `ant checkstyle`, ensure that your source files conform to the coding 
convention.
+  * Run `ant checkstyle` and examine 
`build/checkstyle/checkstyle-errors.html`, ensure that your source files 
conform to the Sun Java coding convention (with the 100 character line length 
exception).
   * Run `ant test`, ensure that tests pass.
   * Run `svn up`, ensure no conflicts with the main repository.
   * Run `svn add` for whatever new files you have created.
   * Ensure that you have added `.q` and `.q.out` tests.
   * Ensure that you have run the `.q` tests for all new functionality.
-  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
+  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated. 
Run `ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q 
-Doverwrite=true` to do this.
   * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where 
NNNN is the issue number the JIRA has assigned to you.
   * Attach your file to the JIRA issue, describe your patch in the comments 
section.
   * Ask for a code review in the comments.
   * Click '''Submit patch''' on your issue after you have completed the steps 
above.
   * It is also advisable to '''watch''' your issue to monitor new comments.
  
+ = Tips, Tricks, Best Practices =
+ 
+  * Hive can have unexpected behavior sometimes. It is best to first run `ant 
clean` if you're seeing something weird, ranging from unexplained exceptions to 
strings being incorrectly double-quoted.
+  * When serializing the aggregation buffer in a `terminatePartial()` call, if 
your UDAF only uses a few variables to represent the buffer (such as average), 
consider serializing them into a list of doubles, for example, instead of 
complicated named structures.
+  * Strongly cast generics wherever you can.
+  * Abstract core functionality from multiple UDAFs into its own class. 
Examples are `histogram_numeric()` and `percentile_approx()`, which both use 
the same core histogram estimation functionality.
+  * If you're stuck looking for an algorithm to adapt to the 
terminatePartial/merge paradigm, divide-and-conquer and parallel algorithms are 
predictably good places to start.
+  * Remember that the tests do a `diff` on the expected and actual output, and 
fail if there is any difference at all. An example of where this can fail 
horribly is a UDAF like `ngrams()`, where the output is a list of sorted 
(word,count) pairs. In some cases, different sort implementations might place 
words with the same count at different positions in the output. Even though the 
output is correct, the test will fail. In these cases, it's better to output 
(for example) only the counts, or some appropriate statistic on the counts, 
like the sum.
+

[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Reply via email to