It sounds like it would be better to accept multiple STORE commands in a
single program and only trigger execution of the map-reduce steps when the
equivalent of a "commit" or "run" is given (EOF being an implied commit).



On 11/20/07 11:27 AM, "Utkarsh Srivastava" <[EMAIL PROTECTED]> wrote:

> The current implementation of SPLIT will be no more efficient that
> explicitly calling STORE.
> 
> Utkarsh
> 
> 
> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:
> 
>> Exactly. You can write "STORE X" for each handle X that you want a
>> result for.
>> 
>> The only issue is that it will create a separate execution job for
>> each STORE command.
>> 
>> If you don't want to pay for doing it in multiple jobs, you could
>> imagine adding a "side store" function to Pig, so that it can store
>> side files but keep processing the "main" program.
>> 
>> It's possible that this can be accomplished today via the SPLIT
>> command -- anyone care to comment?
>> 
>> -Chris
>> 
>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>> 
>>> 
>>> Can you just explicitly save those intermediate results?
>>> 
>>> 
>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
>>> 
>>>> Chris Olston wrote:
>>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>>> operations, in the style of OLAP.
>>>>> 
>>>>> Regarding your 3rd paragraph question, do you mean:
>>>>> 
>>>>> a) there are several interrelated aggregation expressions that
>>>>> you want
>>>>> evaluated in just one pass over the data, or
>>>>> b) you do some initial aggregation, display it to the user, who
>>>>> can do
>>>>> "drill-down" operations in the GUI which require you to look up
>>>>> more
>>>>> data in the backend
>>>>> 
>>>>> ?
>>>>> 
>>>>> For (a), yes Pig can do that, although currently you have to
>>>>> encode it
>>>>> explicitly as a single Pig program (in future versions, we might
>>>>> be able
>>>>> to take multiple related Pig programs and execute them in a joint
>>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>>> without reloading the data, although perhaps the operating
>>>>> system's file
>>>>> cache would help with that, under the covers, if the file
>>>>> partitions fit
>>>>> in memory and don't get evicted.
>>>> 
>>>> Would it be possible to modify Pig (and underlying local/
>>>> mapreduce impl)
>>>> so that if a specific syntax is used then an intermediate result
>>>> is also
>>>> stored into a temporary file? This way, on the first dump/store Pig
>>>> would produce all intermediate results, then keep some of them, and
>>>> re-use them for subsequent operators?
>>>> 
>>>> Example - let's say that ':=' means that the result should be kept
>>>> around until exit (or until any of previous intermediate results
>>>> changes):
>>>> 
>>>> -- A is not persisted
>>>> A = load 'sample.txt' as (date, time, ip, query);
>>>> -- B is to be persisted in a temp file
>>>> B := group A by ip;
>>>> -- compile & execute - creates B in a temp file
>>>> dump B;
>>>> C = foreach B generate group, query;
>>>> -- this uses already existing B data from a temp file
>>>> dump C;
>>>> 
>>> 
>> 
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>> 
>> 
> 

Reply via email to