Hi,

I have a script roughly analogous to this:

users = LOAD '/users.tsv' AS (id);

sessions = LOAD '/sessions.tsv' AS (id, userid, duration, day);

user_sessions = JOIN users BY id INNER, sessions BY userid INNER;

intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day))
{
        //Code omitted
}

daily_use = FOREACH (GROUP intermediate_aggragate BY day)
{
        //Code omitted
}

STORE daily_use INTO '/daily.tsv'

monthly_use = FOREACH (GROUP intermediate_aggregate BY user)
{
        //Code omitted
}

STORE monthly_user INTO '/monthly.tsv'

I realize this script is poorly written, it just illustrates the issue. I would like Pig to calculate intermediate_aggregate once and then re-use it for daily and monthly use (my actual script does intermediate in about 45 minutes, then has ~20 subsidiary tasks that take ~5 minutes each). Pig (0.1) will currently recalculate intermediate_aggregate for the monthly use calculation. Is there any way to reuse the initial calculation? If not, is anyone working on this/knows this is impossible/can tell me where to start on a patch? Obviously, I can do this manually, but it seems like a reasonable thing for Pig to do.

Thank you,
Marshall Weir

Reply via email to