Store later or store delayed or checkpoint all sound good as a way of
expressing this.

I agree that the normal user shouldn't bear the cost of a feature like this.


On 11/20/07 11:58 AM, "Chris Olston" <[EMAIL PROTECTED]> wrote:

> Yes, that's an option.
> 
> For the final "commit" you'd have to associate an explicit scope --
> which STORE statement(s) do I want the system to materialize for me?
> Or is it implicitly all as-yet-unmaterialized STORE commands in the
> current session?
> 
> If this change gets made, it'd be good to ensure that the "old way"
> still works -- most users won't need this functionality and we don't
> want to complicate their lives by making them type STORE followed by
> COMMIT each time.  Maybe we add a new command "STORE LATER" or
> something, for the case where you want to register a STORE but have
> it happen later as part of a batch of stores:
> 
> A = LOAD ...
> B = LOAD ...
> C = FILTER A BY ...
> STORE LATER C INTO ...
> D = JOIN A, B ...
> STORE LATER D INTO ...
> EXECUTE STORE C, D;
> 
> or something alone these lines.
> 
> -Chris
> 
> 
> On Nov 20, 2007, at 11:41 AM, Ted Dunning wrote:
> 
>> 
>> 
>> It sounds like it would be better to accept multiple STORE commands
>> in a
>> single program and only trigger execution of the map-reduce steps
>> when the
>> equivalent of a "commit" or "run" is given (EOF being an implied
>> commit).
>> 
>> 
>> 
>> On 11/20/07 11:27 AM, "Utkarsh Srivastava" <[EMAIL PROTECTED]>
>> wrote:
>> 
>>> The current implementation of SPLIT will be no more efficient that
>>> explicitly calling STORE.
>>> 
>>> Utkarsh
>>> 
>>> 
>>> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:
>>> 
>>>> Exactly. You can write "STORE X" for each handle X that you want a
>>>> result for.
>>>> 
>>>> The only issue is that it will create a separate execution job for
>>>> each STORE command.
>>>> 
>>>> If you don't want to pay for doing it in multiple jobs, you could
>>>> imagine adding a "side store" function to Pig, so that it can store
>>>> side files but keep processing the "main" program.
>>>> 
>>>> It's possible that this can be accomplished today via the SPLIT
>>>> command -- anyone care to comment?
>>>> 
>>>> -Chris
>>>> 
>>>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>>>> 
>>>>> 
>>>>> Can you just explicitly save those intermediate results?
>>>>> 
>>>>> 
>>>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
>>>>> 
>>>>>> Chris Olston wrote:
>>>>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>>>>> operations, in the style of OLAP.
>>>>>>> 
>>>>>>> Regarding your 3rd paragraph question, do you mean:
>>>>>>> 
>>>>>>> a) there are several interrelated aggregation expressions that
>>>>>>> you want
>>>>>>> evaluated in just one pass over the data, or
>>>>>>> b) you do some initial aggregation, display it to the user, who
>>>>>>> can do
>>>>>>> "drill-down" operations in the GUI which require you to look up
>>>>>>> more
>>>>>>> data in the backend
>>>>>>> 
>>>>>>> ?
>>>>>>> 
>>>>>>> For (a), yes Pig can do that, although currently you have to
>>>>>>> encode it
>>>>>>> explicitly as a single Pig program (in future versions, we might
>>>>>>> be able
>>>>>>> to take multiple related Pig programs and execute them in a joint
>>>>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>>>>> without reloading the data, although perhaps the operating
>>>>>>> system's file
>>>>>>> cache would help with that, under the covers, if the file
>>>>>>> partitions fit
>>>>>>> in memory and don't get evicted.
>>>>>> 
>>>>>> Would it be possible to modify Pig (and underlying local/
>>>>>> mapreduce impl)
>>>>>> so that if a specific syntax is used then an intermediate result
>>>>>> is also
>>>>>> stored into a temporary file? This way, on the first dump/store
>>>>>> Pig
>>>>>> would produce all intermediate results, then keep some of them,
>>>>>> and
>>>>>> re-use them for subsequent operators?
>>>>>> 
>>>>>> Example - let's say that ':=' means that the result should be kept
>>>>>> around until exit (or until any of previous intermediate results
>>>>>> changes):
>>>>>> 
>>>>>> -- A is not persisted
>>>>>> A = load 'sample.txt' as (date, time, ip, query);
>>>>>> -- B is to be persisted in a temp file
>>>>>> B := group A by ip;
>>>>>> -- compile & execute - creates B in a temp file
>>>>>> dump B;
>>>>>> C = foreach B generate group, query;
>>>>>> -- this uses already existing B data from a temp file
>>>>>> dump C;
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Christopher Olston, Ph.D.
>>>> Sr. Research Scientist
>>>> Yahoo! Research
>>>> 
>>>> 
>>> 
>> 
> 
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
> 
> 

Reply via email to