Re: storing intermediate results ?

Thejas Nair Wed, 07 Oct 2009 13:19:13 -0700

Hi Zaki,
Please file a jira if you are able to identify the problem you were facing
and the steps to reproduce it.
Thanks,
Thejas





On 10/7/09 1:08 PM, "zaki rahaman" <zaki.raha...@gmail.com> wrote:

> Vincent,
> 
> I've run into this problem before, if you know beforehand that you're going
> to recycle this joined dataset for several different operations or
> pipelines, it is worth your time to simply store it intermediately. While
> Pig can definitely handle this and the Multiquery Optimizer is great, I've
> run into problems with it before (can't remember what now exactly), and
> pre-joining has worked well for me.
> 
> Hopefully you found some part of that useful.
> 
> On Wed, Oct 7, 2009 at 12:33 PM, Ashutosh Chauhan <
> ashutosh.chau...@gmail.com> wrote:
> 
>> Hi Vincent,
>> 
>> Pig has a multi-query optimization which if firing will automatically
>> figure
>> out that join needs to be done only once and there will not be any
>> repetition of work. If Pig determines that its not safe to do that
>> optimization then its possible that your join is getting computed more then
>> once. If thats the case, then it will be better to do the join and store
>> it.
>> You can do that within same script using "exec"
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#exec
>> 
>> You can read more about multi-query optimization here:
>> 
>> http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html#Multi-Query+Execution
>> 
>> Hope it helps,
>> Ashutosh
>> 
>> On Wed, Oct 7, 2009 at 10:54, Vincent BARAT <vincent.ba...@ubikod.com
>>> wrote:
>> 
>>> Hello,
>>> 
>>> I'm new to PIG, and I have a bunch of statements that process the same
>>> input, which is actually the result of a JOIN between two very big data
>> set
>>> (millions of entries).
>>> 
>>> I wonder if it is better (faster) to save the result of this JOIN into an
>>> Hadoop file and then to LOAD it, instead of just relying on PIG
>>> optimizations ?
>>> 
>>> Thank a lot for your help.
>>> 
>> 
> 
>

Re: storing intermediate results ?

Reply via email to