Mike,
Pig has a FOREACH operator that is somewhat analogous to opening a
cursor on a relation.
So you can do something like:
transactions = LOAD '/foo' ...;
by_prod = GROUP data BY product_id;
top_prod_users = FOREACH by_prod GENERATE group as product_id, Top(5,
1, transactions) as top5;
STORE top_prod_users into ...
You would have to write the Top UDF (or wait till I wrap it up, it so
happens I am working on it right now). Alternately, you can also do
this:
top_prod_users = FOREACH by_prod {
ordered = ORDER transactions BY customer_id DESC;
top5 = LIMIT ordered 5;
GENERATE group as product_id, top5;
}
If you are used to SQL, the return of this may be a little weird --
the top 5 users will be a bag (essentially, an unordered array in a
single column). You can use the FLATTEN operator to get that into a
familiar row-by-row representation.
-D
On Mon, Jan 25, 2010 at 10:50 PM, Mike Roberts <[email protected]> wrote:
> I'm trying to use Pig to solve a fairly common SQL scenario that I run into.
> I have boiled the problem down into its most basic form:
>
> You have a table of transactions defined as so: CREATE TABLE transactions
> (product_id INT, customer_id INT)
>
> |--------------------|
> |--Transactions------|
> |---product_id (INT)-|
> |---customer_id(INT)-|
> |--------------------|
>
>
> ****The goal is simple: For each product, produce a list of the top 5 largest
> customers.****
>
> So, the base query would look like this:
>
> SELECT product_id, customer_id, count(*) as products_bought
> FROM transactions
> GROUP BY product_id, customer_id
>
> You could insert that value into another table called products_bought defined
> as:
> CREATE TABLE prod_bought
> (product_id INT, customer_id INT, products_bought INT)
>
> Now you have an intermediate result that tells you how many times each
> customer bought each product. But, obviously, that doesn't completely solve
> the problem.
>
> At this point, in order to solve the problem, you'd have to use a cursor or a
> CROSS APPLY. Here's an example in T-SQL:
>
> --THE CURSOR METHOD:
>
> DECLARE @productId int;
> DECLARE product_cur CURSOR FOR
> SELECT DISTINCT product_id
> FROM transactions t
>
> OPEN product_cur
>
> FETCH product_cur into @productId
>
> WHILE (@@FETCH_STATUS <> -1)
> BEGIN
>
> FETCH product_cur into @productId
> INSERT top_customers_by_product
> SELECT TOP 5 product_id, customer_id, products_bought
> FROM prod_bought
> WHERE product_id = @productId
> ORDER BY products_bought desc
>
>
> END
> CLOSE Domains
> DEALLOCATE Domains
>
>
> --THE CROSS APPLY METHOD:
>
> --First create a user defined function
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
> RETURNS TABLE
> AS
> RETURN
> SELECT TOP 5 product_id, customer_id, products_bought
> FROM prod_bought
> WHERE product_id = @productId
> ORDER BY products_bought desc
> GO
>
> --Build a table of distinct product Ids
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
> --Run the CROSS APPLY
> SELECT A.product_id
> , A.customer_id
> , A.products_bought
> INTO top_customers_by_product
> FROM temp_distinct_product_ids T
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
> dramatically faster for anyone that cares). How can I do the same thing in
> Pig? Here's the question restated: How can I implement a cursor in Pig? How
> can I do a for or while loop in Pig? Can I implement a CROSS APPLY in Pig?
>
>
> I realize that I can implement a cursor outside of Pig and just execute the
> same Pig script over and over and over again. And, that's not a horrible
> solution as long as it leverages the full power of Hadoop. My concern is
> that each of the individual queries that is run inside are fairly
> inexpensive, but the total number of products makes the total job *very*
> expensive.
>
> Also, the solution should be reusable -- I'd really prefer not to write a
> custom jar every time I run into this problem.
>
> Actually, I'm also not particularly religious about using Pig. If there's
> some other tech that does what I need, that's cool too.
>
> Thanks in advance.
>
> Mike Roberts
>
>