pig-user  

Re: How do I use DB instead of file, and is there a stand-alone (no ahdoop) mode?

pi song
Mon, 07 Apr 2008 08:03:13 -0700

1. I think we can refactor load function a bit to make it less hacky to load
from DB. However, how to make it really usable in large data set world is
the real challenging question.

2. From Casper "Logfiles from S3 is already delayed apx. 2 hours. so I
really have no pressure.", this reminds me about stream processing again. I
used to say stream processing is real-time but MapReduce is batch. Now I've
just recognized that we don't have to be strictly real-time. If say we do
process using sliding windows every 2 hours, this way we still can apply
some stream concepts to real-world applications.

Pi

On Tue, Apr 8, 2008 at 12:33 AM, Chris Olston <[EMAIL PROTECTED]> wrote:

> I think it should be possible to write a Pig "load function" that reads
> from a DB, although it may be somewhat hacky (i.e., it will be passed a fake
> HDFS file, but ignore HDFS and connect to the DB instead). My colleagues who
> are closer to the code these days should be able to provide more details.
>
> -Chris
>
>
>
> On Apr 6, 2008, at 8:27 PM, pi song wrote:
>
>  Spencer,
> >
> > That's right. At the moment, Pig cannot load data from DB. However this
> > is a
> > feature that I'm after either. You can have a look at LOLoad and POLoad
> > and
> > may have to refactor them a bit. In the current implementation Pig only
> > reads input from HDFS (If you specify a local file as input, the file
> > will
> > be copied to HDFS before the process kicks off). I'm thinking about two
> > solutions:-
> >
> > 1) Stage you data into a file in HDFS. This sounds inefficient (but
> > practical).
> > 2) Have special input splitter and record reader that reads data from
> > DB.
> > This seems to be efficient but if your dataset is too large your DB will
> > become the bottleneck and drag the whole system down (This breaks the
> > semantic "Hadoop MapReduce runs on top of a reliable file system").
> >
> > Regarding test and dev, now you can run Pig in local execution mode and
> > local hadoop mode (it creates hadoop processes on local machine).
> >
> > Cheers,
> > Pi
> >
> > On 4/7/08, Spencer Proffit <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > Pig comes very close to meeting my needs, however I need to be able to
> > > load from DB.  I noticed in Jira there was a patch to remove
> > > dependency on file names from the loader.  I'd also like it to scale
> > > down so I can test and dev without hadoop.  Is that possible?  I'm
> > > also thinking about using it with Amazon s3 (disclosure: I work for
> > > Amazon).  If these changes are within the scope of where pig wants to
> > > go (it eats everything, right), I'd be happy to work on the changes
> > > myself.
> > >
> > > Spencer
> > >
> > >
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>