pig-user  

Re: How do I use DB instead of file, and is there a stand-alone (no ahdoop) mode?

Chris Olston
Mon, 07 Apr 2008 07:33:42 -0700

I think it should be possible to write a Pig "load function" that reads from a DB, although it may be somewhat hacky (i.e., it will be passed a fake HDFS file, but ignore HDFS and connect to the DB instead). My colleagues who are closer to the code these days should be able to provide more details.

-Chris


On Apr 6, 2008, at 8:27 PM, pi song wrote:

Spencer,

That's right. At the moment, Pig cannot load data from DB. However this is a feature that I'm after either. You can have a look at LOLoad and POLoad and may have to refactor them a bit. In the current implementation Pig only reads input from HDFS (If you specify a local file as input, the file will be copied to HDFS before the process kicks off). I'm thinking about two
solutions:-

1) Stage you data into a file in HDFS. This sounds inefficient (but
practical).
2) Have special input splitter and record reader that reads data from DB. This seems to be efficient but if your dataset is too large your DB will
become the bottleneck and drag the whole system down (This breaks the
semantic "Hadoop MapReduce runs on top of a reliable file system").

Regarding test and dev, now you can run Pig in local execution mode and
local hadoop mode (it creates hadoop processes on local machine).

Cheers,
Pi

On 4/7/08, Spencer Proffit <[EMAIL PROTECTED]> wrote:

Pig comes very close to meeting my needs, however I need to be able to
load from DB.  I noticed in Jira there was a patch to remove
dependency on file names from the loader.  I'd also like it to scale
down so I can test and dev without hadoop.  Is that possible?  I'm
also thinking about using it with Amazon s3 (disclosure: I work for
Amazon).  If these changes are within the scope of where pig wants to
go (it eats everything, right), I'd be happy to work on the changes
myself.

Spencer


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research