Hi,
I have a script roughly analogous to this:
users = LOAD '/users.tsv' AS (id);
sessions = LOAD '/sessions.tsv' AS (id, userid, duration, day);
user_sessions = JOIN users BY id INNER, sessions BY userid INNER;
intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day))
{
//Code omitted
}
daily_use = FOREACH (GROUP intermediate_aggragate BY day)
{
//Code omitted
}
STORE daily_use INTO '/daily.tsv'
monthly_use = FOREACH (GROUP intermediate_aggregate BY user)
{
//Code omitted
}
STORE monthly_user INTO '/monthly.tsv'
I realize this script is poorly written, it just illustrates the
issue. I would like Pig to calculate intermediate_aggregate once and
then re-use it for daily and monthly use (my actual script does
intermediate in about 45 minutes, then has ~20 subsidiary tasks that
take ~5 minutes each). Pig (0.1) will currently recalculate
intermediate_aggregate for the monthly use calculation. Is there any
way to reuse the initial calculation? If not, is anyone working on
this/knows this is impossible/can tell me where to start on a patch?
Obviously, I can do this manually, but it seems like a reasonable
thing for Pig to do.
Thank you,
Marshall Weir