pig-user  

Re: Optimization question

Jianyong Dai
Thu, 18 Mar 2010 15:30:04 -0700

For bag, you need to project it manually. Current optimization does not handle pruning of fields inside a bag. Once you group it as a bag, all the fields inside the bag will be marked as required. So, #1 is faster than #2.

Daniel

Vincent Barat wrote:
Hi,

I wonder if it is faster to firstly extract only the interesting fiels from a bag of tuples before performing other operations on it, or if it is automatically handled by the optimizer:

For exemple, is:

ssessions = FOREACH sessions GENERATE imei;
imei_sessions = GROUP ssessions BY imei;
imei_session_count = FOREACH imei_sessions GENERATE group, COUNT(ssessions);

faster than:

imei_sessions = GROUP sessions BY imei;
imei_session_count = FOREACH imei_sessions GENERATE group, COUNT(sessions);

Thanks for your help