We can use a combination of UDAF and LATERAL VIEW to implement what you want.
1. Define a UDAF like this: max_n(5, products_bought, customer_id) which returns the top 5 products_bought and their customer_id in type of array<struct<col0:int,col1:int>> 2. Use the Lateral views (with explode) to transform a single row into multiple rows. SELECT t.product_id, t5.products_bought, t5.customer_id FROM ( SELECT product_id, max_n(5, products_bought, customer_id) as top5 FROM temp GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS products_bought, customer_id; See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView Paul is the author of UDTF and Lateral view. He might be able to give you more details. Zheng On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts <[email protected]> wrote: > > > I'm trying to use Hive to solve a fairly common SQL scenario that I run > into. I have boiled the problem down into its most basic form: > > > > You have a table of transactions defined as so: CREATE TABLE transactions > (product_id INT, customer_id INT) > > > > |--------------------| > > |--Transactions------| > > |---product_id (INT)-| > > |---customer_id(INT)-| > > |--------------------| > > > > > > ****The goal is simple: For each product, produce a list of the top 5 > largest customers.**** > > > > So, the base query would look like this: > > > > SELECT product_id, customer_id, count(*) as products_bought > > FROM transactions > > GROUP BY product_id, customer_id > > > > You could insert that value into another table called products_bought > defined as: > > CREATE TABLE prod_bought > > (product_id INT, customer_id INT, products_bought INT) > > > > Now you have an intermediate result that tells you how many times each > customer bought each product. But, obviously, that doesn't completely solve > the problem. > > > > At this point, in order to solve the problem, you'd have to use a cursor or > a CROSS APPLY. Here's an example in T-SQL: > > > > --THE CURSOR METHOD: > > > > DECLARE @productId int; > > DECLARE product_cur CURSOR FOR > > SELECT DISTINCT product_id > > FROM transactions t > > > > OPEN product_cur > > > > FETCH product_cur into @productId > > > > WHILE (@@FETCH_STATUS <> -1) > > BEGIN > > > > FETCH product_cur into @productId > > INSERT top_customers_by_product > > SELECT TOP 5 product_id, customer_id, products_bought > > FROM prod_bought > > WHERE product_id = @productId > > ORDER BY products_bought desc > > > > > > END > > CLOSE Domains > > DEALLOCATE Domains > > > > > > --THE CROSS APPLY METHOD: > > > > --First create a user defined function > > CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT) > > RETURNS TABLE > > AS > > RETURN > > SELECT TOP 5 product_id, customer_id, products_bought > > FROM prod_bought > > WHERE product_id = @productId > > ORDER BY products_bought desc > > GO > > > > --Build a table of distinct product Ids > > SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions > > > > --Run the CROSS APPLY > > SELECT A.product_id > > , A.customer_id > > , A.products_bought > > INTO top_customers_by_product > > FROM temp_distinct_product_ids T > > CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A > > > > > > Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is > dramatically faster for anyone that cares). How can I do the same thing in > Hive? Here's the question restated: How can I implement a cursor in Hive? > How can I do a for or while loop in Hive? Can I implement a CROSS APPLY in > Hive? > > > > > > I realize that I can implement a cursor outside of Hive and just execute the > same Hive script over and over and over again. And, that's not a horrible > solution as long as it leverages the full power of Hadoop. My concern is > that each of the individual queries that is run inside are fairly > inexpensive, but the total number of products makes the total job *very* > expensive. > > > > Also, the solution should be reusable -- I'd really prefer not to write a > custom jar every time I run into this problem. > > > > Actually, I’m also not particularly religious about using Hive. If there’s > some other tech that does what I need, that’s cool too. > > > > Thanks in advance. > > > > Mike Roberts -- Yours, Zheng
