Can you just explicitly save those intermediate results?
On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > Chris Olston wrote: >> Sounds interesting. Pig is geared toward large-scale aggregation >> operations, in the style of OLAP. >> >> Regarding your 3rd paragraph question, do you mean: >> >> a) there are several interrelated aggregation expressions that you want >> evaluated in just one pass over the data, or >> b) you do some initial aggregation, display it to the user, who can do >> "drill-down" operations in the GUI which require you to look up more >> data in the backend >> >> ? >> >> For (a), yes Pig can do that, although currently you have to encode it >> explicitly as a single Pig program (in future versions, we might be able >> to take multiple related Pig programs and execute them in a joint >> fashion). For (b), we don't currently have a mechanism to do that >> without reloading the data, although perhaps the operating system's file >> cache would help with that, under the covers, if the file partitions fit >> in memory and don't get evicted. > > Would it be possible to modify Pig (and underlying local/mapreduce impl) > so that if a specific syntax is used then an intermediate result is also > stored into a temporary file? This way, on the first dump/store Pig > would produce all intermediate results, then keep some of them, and > re-use them for subsequent operators? > > Example - let's say that ':=' means that the result should be kept > around until exit (or until any of previous intermediate results changes): > > -- A is not persisted > A = load 'sample.txt' as (date, time, ip, query); > -- B is to be persisted in a temp file > B := group A by ip; > -- compile & execute - creates B in a temp file > dump B; > C = foreach B generate group, query; > -- this uses already existing B data from a temp file > dump C; >
