I'm trying to use Hive to solve a fairly common SQL scenario that I run into.  
I have boiled the problem down into its most basic form:

You have a table of transactions defined as so:  CREATE TABLE transactions 
(product_id INT, customer_id INT)

|--------------------|
|--Transactions------|
|---product_id (INT)-|
|---customer_id(INT)-|
|--------------------|


****The goal is simple: For each product, produce a list of the top 5 largest 
customers.****

So, the base query would look like this:

SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id

You could insert that value into another table called products_bought defined 
as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)

Now you have an intermediate result that tells you how many times each customer 
bought each product.  But, obviously, that doesn't completely solve the problem.

At this point, in order to solve the problem, you'd have to use a cursor or a 
CROSS APPLY.  Here's an example in T-SQL:

--THE CURSOR METHOD:

DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t

OPEN product_cur

FETCH product_cur into @productId

WHILE (@@FETCH_STATUS <> -1)
BEGIN

   FETCH product_cur into @productId
   INSERT top_customers_by_product
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc


END
CLOSE Domains
DEALLOCATE Domains


--THE CROSS APPLY METHOD:

--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc
GO

--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions

--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A


Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is 
dramatically faster for anyone that cares).  How can I do the same thing in 
Hive?  Here's the question restated: How can I implement a cursor in Hive?  How 
can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in Hive?


I realize that I can implement a cursor outside of Hive and just execute the 
same Hive script over and over and over again.  And, that's not a horrible 
solution as long as it leverages the full power of Hadoop.  My concern is that 
each of the individual queries that is run inside are fairly inexpensive, but 
the total number of products makes the total job *very* expensive.

Also, the solution should be reusable -- I'd really prefer not to write a 
custom jar every time I run into this problem.

Actually, I'm also not particularly religious about using Hive.  If there's 
some other tech that does what I need, that's cool too.

Thanks in advance.

Mike Roberts

Reply via email to