I'm trying to use Hive to solve a fairly common SQL scenario that I run into.
I have boiled the problem down into its most basic form:
You have a table of transactions defined as so: CREATE TABLE transactions
(product_id INT, customer_id INT)
|--------------------|
|--Transactions------|
|---product_id (INT)-|
|---customer_id(INT)-|
|--------------------|
****The goal is simple: For each product, produce a list of the top 5 largest
customers.****
So, the base query would look like this:
SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id
You could insert that value into another table called products_bought defined
as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)
Now you have an intermediate result that tells you how many times each customer
bought each product. But, obviously, that doesn't completely solve the problem.
At this point, in order to solve the problem, you'd have to use a cursor or a
CROSS APPLY. Here's an example in T-SQL:
--THE CURSOR METHOD:
DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t
OPEN product_cur
FETCH product_cur into @productId
WHILE (@@FETCH_STATUS <> -1)
BEGIN
FETCH product_cur into @productId
INSERT top_customers_by_product
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc
END
CLOSE Domains
DEALLOCATE Domains
--THE CROSS APPLY METHOD:
--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc
GO
--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
dramatically faster for anyone that cares). How can I do the same thing in
Hive? Here's the question restated: How can I implement a cursor in Hive? How
can I do a for or while loop in Hive? Can I implement a CROSS APPLY in Hive?
I realize that I can implement a cursor outside of Hive and just execute the
same Hive script over and over and over again. And, that's not a horrible
solution as long as it leverages the full power of Hadoop. My concern is that
each of the individual queries that is run inside are fairly inexpensive, but
the total number of products makes the total job *very* expensive.
Also, the solution should be reusable -- I'd really prefer not to write a
custom jar every time I run into this problem.
Actually, I'm also not particularly religious about using Hive. If there's
some other tech that does what I need, that's cool too.
Thanks in advance.
Mike Roberts