Add usage of each_top_k Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/ae2307f2 Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/ae2307f2 Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/ae2307f2
Branch: refs/heads/master Commit: ae2307f260a637714e12c30a4a2c4b5b4afa584e Parents: e987711 Author: myui <[email protected]> Authored: Wed Nov 16 17:40:11 2016 +0900 Committer: myui <[email protected]> Committed: Wed Nov 16 17:40:11 2016 +0900 ---------------------------------------------------------------------- docs/gitbook/misc/topk.md | 185 ++++++++++++++++++++++++++++------------- 1 file changed, 127 insertions(+), 58 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/ae2307f2/docs/gitbook/misc/topk.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/topk.md b/docs/gitbook/misc/topk.md index 3d072ed..d6e7b93 100644 --- a/docs/gitbook/misc/topk.md +++ b/docs/gitbook/misc/topk.md @@ -21,17 +21,86 @@ This function is particularly useful for applying a similarity/distance function where the computation complexity is **O(nm)**. -`each_top_k` is very fast when compared to other methods running top-k queries (e.g., [`rank/distributed by`](https://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/)) in Hive. +`each_top_k` is very fast when compared to other methods running top-k queries (e.g., [`rank/distribute by`](https://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/)) in Hive. ## Caution * `each_top_k` is supported from Hivemall v0.3.2-3 or later. -* This UDTF assumes that input records are sorted by `group`. Use `DISTRIBUTED BY group SORTED BY group` to ensure that. Or, you can use `LEFT OUTER JOIN` for certain cases. +* This UDTF assumes that input records are sorted by `group`. Use `DISTRIBUTE BY group SORT BY group` to ensure that. Or, you can use `LEFT OUTER JOIN` for certain cases. * It takes variable lengths arguments in `argN`. * The third argument `value` is used for the comparison. * `Any number types` or `timestamp` are accepted for the type of `value`. * If k is less than 0, reverse order is used and `tail-K` records are returned for each `group`. * Note that this function returns [a pseudo ranking](http://www.michaelpollmeier.com/selecting-top-k-items-from-a-list-efficiently-in-java-groovy/) for top-k. It always returns `at-most K` records for each group. The ranking scheme is similar to `dense_rank` but slightly different in certain cases. +# Efficient Top-k Query Processing using `each_top_k` + +Efficient processing of Top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. +Our Hive extension `each_top_k` helps running Top-k processing efficiently. + +- Suppose the following table as the input + +|student | class | score | +|:------:|:-----:|:-----:| +|1 | b | 70 | +|2 | a | 80 | +|3 | a | 90 | +|4 | b | 50 | +|5 | a | 70 | +|6 | b | 60 | + +- Then, list top-2 students for each class + +|student | class | score | rank | +|:------:|:-----:|:-----:|:----:| +|3 | a | 90 | 1 | +|2 | a | 80 | 2 | +|1 | b | 70 | 1 | +|6 | b | 60 | 2 | + +The standard way using SQL window function would be as follows: + +```sql +SELECT + student, class, score, rank +FROM ( + SELECT + student, class, score, + rank() over (PARTITION BY class ORDER BY score DESC) as rank + FROM + table +) t +WHRE rank <= 2 +``` + +An alternative and efficient way to compute top-k items using `each_top_k` is as follows: + +```sql +SELECT + each_top_k( + 2, class, score, + class, student -- output columns other in addition to rank and score + ) as (rank, score, class, student) +FROM ( + SELECT * FROM table + CLUSTER BY class -- Mandatory for `each_top_k` +) t +``` + +> #### Note +`CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required when using `each_top_k`. + +The function signature of `each_top_k` is `each_top_k(int k, ANY group, double value, arg1, arg2, ..., argN)` and it returns a relation `(int rank, double value, arg1, arg2, .., argN)`. + +Any number types or timestamp are accepted for the type of `value` but it MUST be not NULL. +Do null hanlding like `if(value is null, -1, value)` to avoid null. + +If `k` is less than 0, reverse order is used and tail-K records are returned for each `group`. + +The ranking semantics of `each_top_k` follows SQL's `dense_rank` and then limits results by `k`. + +> #### Caution +`each_top_k` is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using `rank() over` instead. + # Usage ## top-k clicks @@ -77,34 +146,34 @@ FROM LEFT OUTER JOIN train_hivemall t1; ``` -``` -1 0.8594650626182556 12 10514 0 -2 0.8585299849510193 12 11719 0 -3 0.856602132320404 12 21009 0 -4 0.8562054634094238 12 17582 0 -5 0.8516314029693604 12 22006 0 -6 0.8499397039413452 12 25364 0 -7 0.8467264771461487 12 900 0 -8 0.8463355302810669 12 8018 0 -9 0.8439178466796875 12 7041 0 -10 0.8438876867294312 12 21595 0 -1 0.8390793800354004 25 21125 0 -2 0.8344510793685913 25 14073 0 -3 0.8340602517127991 25 9008 0 -4 0.8328862190246582 25 6598 0 -5 0.8301891088485718 25 943 0 -6 0.8271955251693726 25 20400 0 -7 0.8255619406700134 25 10922 0 -8 0.8241575956344604 25 8477 0 -9 0.822281539440155 25 25977 0 -10 0.8205751180648804 25 21115 0 -1 0.9761330485343933 34 2513 0 -2 0.9536819458007812 34 8697 0 -3 0.9531533122062683 34 7326 0 -4 0.9493276476860046 34 15173 0 -5 0.9480557441711426 34 19468 0 -... -``` +| rank | similarity | base_id | neighbor_id | y | +|:----:|:----------:|:-------:|:-----------:|:-:| +| 1 | 0.8594650626182556 | 12 | 10514 | 0 | +| 2 | 0.8585299849510193 | 12 | 11719 | 0 | +| 3 | 0.856602132320404 | 12 | 21009 | 0 | +| 4 | 0.8562054634094238 | 12 | 17582 | 0 | +| 5 | 0.8516314029693604 | 12 | 22006 | 0 | +| 6 | 0.8499397039413452 | 12 | 25364 | 0 | +| 7 | 0.8467264771461487 | 12 | 900 | 0 | +| 8 | 0.8463355302810669 | 12 | 8018 | 0 | +| 9 | 0.8439178466796875 | 12 | 7041 | 0 | +| 10 | 0.8438876867294312 | 12 | 21595 | 0 | +| 1 | 0.8390793800354004 | 25 | 21125 | 0 | +| 2 | 0.8344510793685913 | 25 | 14073 | 0 | +| 3 | 0.8340602517127991 | 25 | 9008 | 0 | +| 4 | 0.8328862190246582 | 25 | 6598 | 0 | +| 5 | 0.8301891088485718 | 25 | 943 | 0 | +| 6 | 0.8271955251693726 | 25 | 20400 | 0 | +| 7 | 0.8255619406700134 | 25 | 10922 | 0 | +| 8 | 0.8241575956344604 | 25 | 8477 | 0 | +| 9 | 0.822281539440155 | 25 | 25977 | 0 | +| 10 | 0.8205751180648804 | 25 | 21115 | 0 | +| 1 | 0.9761330485343933 | 34 | 2513 | 0 | +| 2 | 0.9536819458007812 | 34 | 8697 | 0 | +| 3 | 0.9531533122062683 | 34 | 7326 | 0 | +| 4 | 0.9493276476860046 | 34 | 15173 | 0 | +| 5 | 0.9480557441711426 | 34 | 19468 | 0 | +| .. | .. | .. | .. | .. | ### Explicit grouping using `distribute by` and `sort by` @@ -277,31 +346,31 @@ FROM -- limit 25 ``` -``` -1 0.4383084177970886 1 7503 0 -2 0.44166821241378784 1 10143 0 -3 0.4424300789833069 1 11073 0 -4 0.44254064559936523 1 17782 0 -5 0.4442034363746643 1 18556 0 -6 0.45163780450820923 1 3786 0 -7 0.45244503021240234 1 10242 0 -8 0.4525672197341919 1 21657 0 -9 0.4527127146720886 1 17218 0 -10 0.45314133167266846 1 25141 0 -1 0.44030147790908813 2 3786 0 -2 0.4408798813819885 2 23386 0 -3 0.44112563133239746 2 11073 0 -4 0.4415401816368103 2 22853 0 -5 0.4422193765640259 2 21657 0 -6 0.4429032802581787 2 10143 0 -7 0.4435907006263733 2 24413 0 -8 0.44569307565689087 2 7503 0 -9 0.4460843801498413 2 25141 0 -10 0.4464914798736572 2 24289 0 -1 0.43862903118133545 3 23150 1 -2 0.4398220181465149 3 9881 1 -3 0.44283604621887207 3 27121 0 -4 0.4432108402252197 3 26220 1 -5 0.44323229789733887 3 18541 0 -... -``` \ No newline at end of file +| rank | similarity | base_id | neighbor_id | y | +|:----:|:----------:|:-------:|:-----------:|:-:| +| 1 | 0.4383084177970886 | 1 | 7503 | 0 | +| 2 | 0.44166821241378784 | 1 | 10143 | 0 | +| 3 | 0.4424300789833069 | 1 | 11073 | 0 | +| 4 | 0.44254064559936523 | 1 | 17782 | 0 | +| 5 | 0.4442034363746643 | 1 | 18556 | 0 | +| 6 | 0.45163780450820923 | 1 | 3786 | 0 | +| 7 | 0.45244503021240234 | 1 | 10242 | 0 | +| 8 | 0.4525672197341919 | 1 | 21657 | 0 | +| 9 | 0.4527127146720886 | 1 | 17218 | 0 | +| 10 | 0.45314133167266846 | 1 | 25141 | 0 | +| 1 | 0.44030147790908813 | 2 | 3786 | 0 | +| 2 | 0.4408798813819885 | 2 | 23386 | 0 | +| 3 | 0.44112563133239746 | 2 | 11073 | 0 | +| 4 | 0.4415401816368103 | 2 | 22853 | 0 | +| 5 | 0.4422193765640259 | 2 | 21657 | 0 | +| 6 | 0.4429032802581787 | 2 | 10143 | 0 | +| 7 | 0.4435907006263733 | 2 | 24413 | 0 | +| 8 | 0.44569307565689087 | 2 | 7503 | 0 | +| 9 | 0.4460843801498413 | 2 | 25141 | 0 | +| 10 | 0.4464914798736572 | 2 | 24289 | 0 | +| 1 | 0.43862903118133545 | 3 | 23150 | 1 | +| 2 | 0.4398220181465149 | 3 | 9881 | 1 | +| 3 | 0.44283604621887207 | 3 | 27121 | 0 | +| 4 | 0.4432108402252197 | 3 | 26220 | 1 | +| 5 | 0.44323229789733887 | 3 | 18541 | 0 | +| .. | .. | .. | .. | .. |
