It sounds like it would be better to accept multiple STORE commands in a single program and only trigger execution of the map-reduce steps when the equivalent of a "commit" or "run" is given (EOF being an implied commit).
On 11/20/07 11:27 AM, "Utkarsh Srivastava" <[EMAIL PROTECTED]> wrote: > The current implementation of SPLIT will be no more efficient that > explicitly calling STORE. > > Utkarsh > > > On Nov 20, 2007, at 10:56 AM, Chris Olston wrote: > >> Exactly. You can write "STORE X" for each handle X that you want a >> result for. >> >> The only issue is that it will create a separate execution job for >> each STORE command. >> >> If you don't want to pay for doing it in multiple jobs, you could >> imagine adding a "side store" function to Pig, so that it can store >> side files but keep processing the "main" program. >> >> It's possible that this can be accomplished today via the SPLIT >> command -- anyone care to comment? >> >> -Chris >> >> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote: >> >>> >>> Can you just explicitly save those intermediate results? >>> >>> >>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: >>> >>>> Chris Olston wrote: >>>>> Sounds interesting. Pig is geared toward large-scale aggregation >>>>> operations, in the style of OLAP. >>>>> >>>>> Regarding your 3rd paragraph question, do you mean: >>>>> >>>>> a) there are several interrelated aggregation expressions that >>>>> you want >>>>> evaluated in just one pass over the data, or >>>>> b) you do some initial aggregation, display it to the user, who >>>>> can do >>>>> "drill-down" operations in the GUI which require you to look up >>>>> more >>>>> data in the backend >>>>> >>>>> ? >>>>> >>>>> For (a), yes Pig can do that, although currently you have to >>>>> encode it >>>>> explicitly as a single Pig program (in future versions, we might >>>>> be able >>>>> to take multiple related Pig programs and execute them in a joint >>>>> fashion). For (b), we don't currently have a mechanism to do that >>>>> without reloading the data, although perhaps the operating >>>>> system's file >>>>> cache would help with that, under the covers, if the file >>>>> partitions fit >>>>> in memory and don't get evicted. >>>> >>>> Would it be possible to modify Pig (and underlying local/ >>>> mapreduce impl) >>>> so that if a specific syntax is used then an intermediate result >>>> is also >>>> stored into a temporary file? This way, on the first dump/store Pig >>>> would produce all intermediate results, then keep some of them, and >>>> re-use them for subsequent operators? >>>> >>>> Example - let's say that ':=' means that the result should be kept >>>> around until exit (or until any of previous intermediate results >>>> changes): >>>> >>>> -- A is not persisted >>>> A = load 'sample.txt' as (date, time, ip, query); >>>> -- B is to be persisted in a temp file >>>> B := group A by ip; >>>> -- compile & execute - creates B in a temp file >>>> dump B; >>>> C = foreach B generate group, query; >>>> -- this uses already existing B data from a temp file >>>> dump C; >>>> >>> >> >> -- >> Christopher Olston, Ph.D. >> Sr. Research Scientist >> Yahoo! Research >> >> >
