Yes, that's an option.

For the final "commit" you'd have to associate an explicit scope -- which STORE statement(s) do I want the system to materialize for me? Or is it implicitly all as-yet-unmaterialized STORE commands in the current session?

If this change gets made, it'd be good to ensure that the "old way" still works -- most users won't need this functionality and we don't want to complicate their lives by making them type STORE followed by COMMIT each time. Maybe we add a new command "STORE LATER" or something, for the case where you want to register a STORE but have it happen later as part of a batch of stores:

A = LOAD ...
B = LOAD ...
C = FILTER A BY ...
STORE LATER C INTO ...
D = JOIN A, B ...
STORE LATER D INTO ...
EXECUTE STORE C, D;

or something alone these lines.

-Chris


On Nov 20, 2007, at 11:41 AM, Ted Dunning wrote:



It sounds like it would be better to accept multiple STORE commands in a single program and only trigger execution of the map-reduce steps when the equivalent of a "commit" or "run" is given (EOF being an implied commit).



On 11/20/07 11:27 AM, "Utkarsh Srivastava" <[EMAIL PROTECTED]> wrote:

The current implementation of SPLIT will be no more efficient that
explicitly calling STORE.

Utkarsh


On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:

Exactly. You can write "STORE X" for each handle X that you want a
result for.

The only issue is that it will create a separate execution job for
each STORE command.

If you don't want to pay for doing it in multiple jobs, you could
imagine adding a "side store" function to Pig, so that it can store
side files but keep processing the "main" program.

It's possible that this can be accomplished today via the SPLIT
command -- anyone care to comment?

-Chris

On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:


Can you just explicitly save those intermediate results?


On 11/20/07 10:31 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

Chris Olston wrote:
Sounds interesting. Pig is geared toward large-scale aggregation
operations, in the style of OLAP.

Regarding your 3rd paragraph question, do you mean:

a) there are several interrelated aggregation expressions that
you want
evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, who
can do
"drill-down" operations in the GUI which require you to look up
more
data in the backend

?

For (a), yes Pig can do that, although currently you have to
encode it
explicitly as a single Pig program (in future versions, we might
be able
to take multiple related Pig programs and execute them in a joint
fashion). For (b), we don't currently have a mechanism to do that
without reloading the data, although perhaps the operating
system's file
cache would help with that, under the covers, if the file
partitions fit
in memory and don't get evicted.

Would it be possible to modify Pig (and underlying local/
mapreduce impl)
so that if a specific syntax is used then an intermediate result
is also
stored into a temporary file? This way, on the first dump/store Pig would produce all intermediate results, then keep some of them, and
re-use them for subsequent operators?

Example - let's say that ':=' means that the result should be kept
around until exit (or until any of previous intermediate results
changes):

-- A is not persisted
A = load 'sample.txt' as (date, time, ip, query);
-- B is to be persisted in a temp file
B := group A by ip;
-- compile & execute - creates B in a temp file
dump B;
C = foreach B generate group, query;
-- this uses already existing B data from a temp file
dump C;



--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research





--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


Reply via email to