The optimizer runs when Pig is invoked from Java. However, until
recently join and multi-query optimization did not work together. See http://issues.apache.org/jira/browse/PIG-983
Alan.
On Oct 8, 2009, at 6:33 AM, Vincent BARAT wrote:
Ok, then I did some testing.
Actually, if I store my first JOIN into a file, I see a 50% increase
of the speed of all my subsequents computations.
I guess that it may be related to the fact I use PIG from Java
(maybe the optimizer don't work in that mode?).
Here is my code (including just the JOIN and the first computation):
Data loading:
-------------
Analytics.pigServer
.registerQuery("start_sessions = LOAD 'startSession_sample'
USING PigStorage(',') "
+ "AS (sid:chararray, infoid:chararray, imei:chararray,
start:long);");
Analytics.pigServer
.registerQuery("end_sessions = LOAD 'endSession_sample'
USING PigStorage(',') "
+ "AS (sid:chararray, infoid:chararray, imei:chararray,
end:long);");
First Join (with storage):
---------------------------
Analytics.pigServer
.registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");
Analytics.pigServer.store("sessions", "sessions");
Analytics.pigServer
.registerQuery("sessions = LOAD 'sessions' "
+ "AS (start_sessions::sid:chararray,
start_sessions::infoid:chararray, start_sessions::imei:chararray,
start_sessions::start:long, "
+ "end_sessions::sid:chararray,
end_sessions::infoid:chararray, end_sessions::imei:chararray,
end_sessions::end:long);");
First join (without storage):
-----------------------------
Analytics.pigServer
.registerQuery("sessions = JOIN start_sessions BY sid,
end_sessions BY sid;");
First computation:
------------------
Analytics.pigServer.registerQuery("session_periods =
FOREACH sessions "
+ "GENERATE FLATTEN(SessionPeriods('" +
timeBucket.toString() + "', start, end)) "
+ "AS (periodid:int, inner_length:long,
outer_length:long);");
Analytics.pigServer.registerQuery("period_sessions = GROUP
session_periods BY periodid;");
Analytics.pigServer.registerQuery("session_count_and_length"
+ " = FOREACH period_sessions " + "GENERATE group, " +
"COUNT(session_periods), "
+ "SUM(session_periods.inner_length), " +
"SUM(session_periods.outer_length);");
Analytics.pigServer.store("session_count_and_length",
Analytics.getHadoopOutputFile(
"session_count_and_length", timeBucket));
Thejas Nair a écrit :
Hi Zaki,
Please file a jira if you are able to identify the problem you were
facing
and the steps to reproduce it.
Thanks,
Thejas
On 10/7/09 1:08 PM, "zaki rahaman" <zaki.raha...@gmail.com> wrote:
Vincent,
I've run into this problem before, if you know beforehand that
you're going
to recycle this joined dataset for several different operations or
pipelines, it is worth your time to simply store it
intermediately. While
Pig can definitely handle this and the Multiquery Optimizer is
great, I've
run into problems with it before (can't remember what now
exactly), and
pre-joining has worked well for me.
Hopefully you found some part of that useful.
On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
ashutosh.chau...@gmail.com> wrote:
Hi Vincent,
Pig has a multi-query optimization which if firing will
automatically
figure
out that join needs to be done only once and there will not be any
repetition of work. If Pig determines that its not safe to do that
optimization then its possible that your join is getting computed
more then
once. If thats the case, then it will be better to do the join
and store
it.
You can do that within same script using "exec"
http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
You can read more about multi-query optimization here:
http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
Hope it helps,
Ashutosh
On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.ba...@ubikod.com
wrote:
Hello,
I'm new to PIG, and I have a bunch of statements that process
the same
input, which is actually the result of a JOIN between two very
big data
set
(millions of entries).
I wonder if it is better (faster) to save the result of this
JOIN into an
Hadoop file and then to LOAD it, instead of just relying on PIG
optimizations ?
Thank a lot for your help.