Store later or store delayed or checkpoint all sound good as a way of expressing this.
I agree that the normal user shouldn't bear the cost of a feature like this. On 11/20/07 11:58 AM, "Chris Olston" <[EMAIL PROTECTED]> wrote: > Yes, that's an option. > > For the final "commit" you'd have to associate an explicit scope -- > which STORE statement(s) do I want the system to materialize for me? > Or is it implicitly all as-yet-unmaterialized STORE commands in the > current session? > > If this change gets made, it'd be good to ensure that the "old way" > still works -- most users won't need this functionality and we don't > want to complicate their lives by making them type STORE followed by > COMMIT each time. Maybe we add a new command "STORE LATER" or > something, for the case where you want to register a STORE but have > it happen later as part of a batch of stores: > > A = LOAD ... > B = LOAD ... > C = FILTER A BY ... > STORE LATER C INTO ... > D = JOIN A, B ... > STORE LATER D INTO ... > EXECUTE STORE C, D; > > or something alone these lines. > > -Chris > > > On Nov 20, 2007, at 11:41 AM, Ted Dunning wrote: > >> >> >> It sounds like it would be better to accept multiple STORE commands >> in a >> single program and only trigger execution of the map-reduce steps >> when the >> equivalent of a "commit" or "run" is given (EOF being an implied >> commit). >> >> >> >> On 11/20/07 11:27 AM, "Utkarsh Srivastava" <[EMAIL PROTECTED]> >> wrote: >> >>> The current implementation of SPLIT will be no more efficient that >>> explicitly calling STORE. >>> >>> Utkarsh >>> >>> >>> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote: >>> >>>> Exactly. You can write "STORE X" for each handle X that you want a >>>> result for. >>>> >>>> The only issue is that it will create a separate execution job for >>>> each STORE command. >>>> >>>> If you don't want to pay for doing it in multiple jobs, you could >>>> imagine adding a "side store" function to Pig, so that it can store >>>> side files but keep processing the "main" program. >>>> >>>> It's possible that this can be accomplished today via the SPLIT >>>> command -- anyone care to comment? >>>> >>>> -Chris >>>> >>>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote: >>>> >>>>> >>>>> Can you just explicitly save those intermediate results? >>>>> >>>>> >>>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Chris Olston wrote: >>>>>>> Sounds interesting. Pig is geared toward large-scale aggregation >>>>>>> operations, in the style of OLAP. >>>>>>> >>>>>>> Regarding your 3rd paragraph question, do you mean: >>>>>>> >>>>>>> a) there are several interrelated aggregation expressions that >>>>>>> you want >>>>>>> evaluated in just one pass over the data, or >>>>>>> b) you do some initial aggregation, display it to the user, who >>>>>>> can do >>>>>>> "drill-down" operations in the GUI which require you to look up >>>>>>> more >>>>>>> data in the backend >>>>>>> >>>>>>> ? >>>>>>> >>>>>>> For (a), yes Pig can do that, although currently you have to >>>>>>> encode it >>>>>>> explicitly as a single Pig program (in future versions, we might >>>>>>> be able >>>>>>> to take multiple related Pig programs and execute them in a joint >>>>>>> fashion). For (b), we don't currently have a mechanism to do that >>>>>>> without reloading the data, although perhaps the operating >>>>>>> system's file >>>>>>> cache would help with that, under the covers, if the file >>>>>>> partitions fit >>>>>>> in memory and don't get evicted. >>>>>> >>>>>> Would it be possible to modify Pig (and underlying local/ >>>>>> mapreduce impl) >>>>>> so that if a specific syntax is used then an intermediate result >>>>>> is also >>>>>> stored into a temporary file? This way, on the first dump/store >>>>>> Pig >>>>>> would produce all intermediate results, then keep some of them, >>>>>> and >>>>>> re-use them for subsequent operators? >>>>>> >>>>>> Example - let's say that ':=' means that the result should be kept >>>>>> around until exit (or until any of previous intermediate results >>>>>> changes): >>>>>> >>>>>> -- A is not persisted >>>>>> A = load 'sample.txt' as (date, time, ip, query); >>>>>> -- B is to be persisted in a temp file >>>>>> B := group A by ip; >>>>>> -- compile & execute - creates B in a temp file >>>>>> dump B; >>>>>> C = foreach B generate group, query; >>>>>> -- this uses already existing B data from a temp file >>>>>> dump C; >>>>>> >>>>> >>>> >>>> -- >>>> Christopher Olston, Ph.D. >>>> Sr. Research Scientist >>>> Yahoo! Research >>>> >>>> >>> >> > > -- > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > >
