Can you just explicitly save those intermediate results?

On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Chris Olston wrote:
>> Sounds interesting. Pig is geared toward large-scale aggregation
>> operations, in the style of OLAP.
>> 
>> Regarding your 3rd paragraph question, do you mean:
>> 
>> a) there are several interrelated aggregation expressions that you want
>> evaluated in just one pass over the data, or
>> b) you do some initial aggregation, display it to the user, who can do
>> "drill-down" operations in the GUI which require you to look up more
>> data in the backend
>> 
>> ?
>> 
>> For (a), yes Pig can do that, although currently you have to encode it
>> explicitly as a single Pig program (in future versions, we might be able
>> to take multiple related Pig programs and execute them in a joint
>> fashion). For (b), we don't currently have a mechanism to do that
>> without reloading the data, although perhaps the operating system's file
>> cache would help with that, under the covers, if the file partitions fit
>> in memory and don't get evicted.
> 
> Would it be possible to modify Pig (and underlying local/mapreduce impl)
> so that if a specific syntax is used then an intermediate result is also
> stored into a temporary file? This way, on the first dump/store Pig
> would produce all intermediate results, then keep some of them, and
> re-use them for subsequent operators?
> 
> Example - let's say that ':=' means that the result should be kept
> around until exit (or until any of previous intermediate results changes):
> 
> -- A is not persisted
> A = load 'sample.txt' as (date, time, ip, query);
> -- B is to be persisted in a temp file
> B := group A by ip;
> -- compile & execute - creates B in a temp file
> dump B;
> C = foreach B generate group, query;
> -- this uses already existing B data from a temp file
> dump C;
> 

Reply via email to