Re: possible use of Pig for OLAP

Chris Olston Tue, 20 Nov 2007 10:57:20 -0800

Exactly. You can write "STORE X" for each handle X that you want aresult for.

The only issue is that it will create a separate execution job foreach STORE command.

If you don't want to pay for doing it in multiple jobs, you couldimagine adding a "side store" function to Pig, so that it can storeside files but keep processing the "main" program.

It's possible that this can be accomplished today via the SPLITcommand -- anyone care to comment?


-Chris

On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:

Can you just explicitly save those intermediate results?


On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
Chris Olston wrote:
Sounds interesting. Pig is geared toward large-scale aggregation
operations, in the style of OLAP.

Regarding your 3rd paragraph question, do you mean:
a) there are several interrelated aggregation expressions thatyou want
evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, whocan do
"drill-down" operations in the GUI which require you to look up more
data in the backend

?
For (a), yes Pig can do that, although currently you have toencode itexplicitly as a single Pig program (in future versions, we mightbe able
to take multiple related Pig programs and execute them in a joint
fashion). For (b), we don't currently have a mechanism to do that
without reloading the data, although perhaps the operatingsystem's filecache would help with that, under the covers, if the filepartitions fit
in memory and don't get evicted.
Would it be possible to modify Pig (and underlying local/mapreduceimpl)so that if a specific syntax is used then an intermediate resultis also
stored into a temporary file? This way, on the first dump/store Pig
would produce all intermediate results, then keep some of them, and
re-use them for subsequent operators?

Example - let's say that ':=' means that the result should be kept
around until exit (or until any of previous intermediate resultschanges):
-- A is not persisted
A = load 'sample.txt' as (date, time, ip, query);
-- B is to be persisted in a temp file
B := group A by ip;
-- compile & execute - creates B in a temp file
dump B;
C = foreach B generate group, query;
-- this uses already existing B data from a temp file
dump C;


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: possible use of Pig for OLAP

Reply via email to