Mike,

Pig has a FOREACH operator that is somewhat analogous to opening a
cursor on a relation.

So you can do something like:

transactions = LOAD '/foo' ...;
by_prod = GROUP data BY product_id;
top_prod_users = FOREACH by_prod GENERATE group as product_id, Top(5,
1, transactions) as top5;
STORE top_prod_users into ...

You would have to write the Top UDF (or wait till I wrap it up, it so
happens I am working on it right now). Alternately, you can also do
this:

top_prod_users = FOREACH by_prod {
  ordered = ORDER transactions BY customer_id DESC;
  top5 = LIMIT ordered 5;
   GENERATE group as product_id, top5;
}

If you are used to SQL, the return of this may be a little weird --
the top 5 users will be a bag (essentially, an unordered array in a
single column).  You can use the FLATTEN operator to get that into a
familiar row-by-row representation.

-D



On Mon, Jan 25, 2010 at 10:50 PM, Mike Roberts <[email protected]> wrote:
> I'm trying to use Pig to solve a fairly common SQL scenario that I run into.  
> I have boiled the problem down into its most basic form:
>
> You have a table of transactions defined as so:  CREATE TABLE transactions 
> (product_id INT, customer_id INT)
>
> |--------------------|
> |--Transactions------|
> |---product_id (INT)-|
> |---customer_id(INT)-|
> |--------------------|
>
>
> ****The goal is simple: For each product, produce a list of the top 5 largest 
> customers.****
>
> So, the base query would look like this:
>
> SELECT product_id, customer_id, count(*) as products_bought
> FROM transactions
> GROUP BY product_id, customer_id
>
> You could insert that value into another table called products_bought defined 
> as:
> CREATE TABLE prod_bought
> (product_id INT, customer_id INT, products_bought INT)
>
> Now you have an intermediate result that tells you how many times each 
> customer bought each product.  But, obviously, that doesn't completely solve 
> the problem.
>
> At this point, in order to solve the problem, you'd have to use a cursor or a 
> CROSS APPLY.  Here's an example in T-SQL:
>
> --THE CURSOR METHOD:
>
> DECLARE @productId int;
> DECLARE product_cur CURSOR FOR
> SELECT DISTINCT product_id
> FROM transactions t
>
> OPEN product_cur
>
> FETCH product_cur into @productId
>
> WHILE (@@FETCH_STATUS <> -1)
> BEGIN
>
>   FETCH product_cur into @productId
>   INSERT top_customers_by_product
>                SELECT TOP 5 product_id, customer_id, products_bought
>                FROM prod_bought
>                WHERE product_id = @productId
>                ORDER BY products_bought desc
>
>
> END
> CLOSE Domains
> DEALLOCATE Domains
>
>
> --THE CROSS APPLY METHOD:
>
> --First create a user defined function
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
> RETURNS TABLE
> AS
> RETURN
>                SELECT TOP 5 product_id, customer_id, products_bought
>                FROM prod_bought
>                WHERE product_id = @productId
>                ORDER BY products_bought desc
> GO
>
> --Build a table of distinct product Ids
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
> --Run the CROSS APPLY
> SELECT A.product_id
> , A.customer_id
> , A.products_bought
> INTO top_customers_by_product
> FROM temp_distinct_product_ids T
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is 
> dramatically faster for anyone that cares).  How can I do the same thing in 
> Pig?  Here's the question restated: How can I implement a cursor in Pig?  How 
> can I do a for or while loop in Pig?  Can I implement a CROSS APPLY in Pig?
>
>
> I realize that I can implement a cursor outside of Pig and just execute the 
> same Pig script over and over and over again.  And, that's not a horrible 
> solution as long as it leverages the full power of Hadoop.  My concern is 
> that each of the individual queries that is run inside are fairly 
> inexpensive, but the total number of products makes the total job *very* 
> expensive.
>
> Also, the solution should be reusable -- I'd really prefer not to write a 
> custom jar every time I run into this problem.
>
> Actually, I'm also not particularly religious about using Pig.  If there's 
> some other tech that does what I need, that's cool too.
>
> Thanks in advance.
>
> Mike Roberts
>
>

Reply via email to