Sounds interesting. Pig is geared toward large-scale aggregation
operations, in the style of OLAP.
Regarding your 3rd paragraph question, do you mean:
a) there are several interrelated aggregation expressions that you
want evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, who can
do "drill-down" operations in the GUI which require you to look up
more data in the backend
?
For (a), yes Pig can do that, although currently you have to encode
it explicitly as a single Pig program (in future versions, we might
be able to take multiple related Pig programs and execute them in a
joint fashion). For (b), we don't currently have a mechanism to do
that without reloading the data, although perhaps the operating
system's file cache would help with that, under the covers, if the
file partitions fit in memory and don't get evicted.
-Chris
On Nov 20, 2007, at 1:47 AM, Alexandru Toth wrote:
Hi,
I am developing an Open Source OLAP application called "Cubulus". The
code is at http://sourceforge.net/projects/cubulus/ , a brief
presentation material at http://cubulus.sourceforge.net/ , and an
online demo at: http://alxtoth.webfactional.com
It would be interresting to use Pig instead of relational databases
as backend.
The question is: can Pig scripts work is such manner that the file is
loaded only once, and then subsequent web requests process over and
over the same file? This becomes relevant if the data file is large,
and there is one datafile to process (or few datafiles). In fact, is
repated loading a problem at all :-) ?
-Alex
--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research