pi song
Sun, 06 Apr 2008 20:28:15 -0700
Spencer, That's right. At the moment, Pig cannot load data from DB. However this is a feature that I'm after either. You can have a look at LOLoad and POLoad and may have to refactor them a bit. In the current implementation Pig only reads input from HDFS (If you specify a local file as input, the file will be copied to HDFS before the process kicks off). I'm thinking about two solutions:- 1) Stage you data into a file in HDFS. This sounds inefficient (but practical). 2) Have special input splitter and record reader that reads data from DB. This seems to be efficient but if your dataset is too large your DB will become the bottleneck and drag the whole system down (This breaks the semantic "Hadoop MapReduce runs on top of a reliable file system"). Regarding test and dev, now you can run Pig in local execution mode and local hadoop mode (it creates hadoop processes on local machine). Cheers, Pi On 4/7/08, Spencer Proffit <[EMAIL PROTECTED]> wrote: > > Pig comes very close to meeting my needs, however I need to be able to > load from DB. I noticed in Jira there was a patch to remove > dependency on file names from the loader. I'd also like it to scale > down so I can test and dev without hadoop. Is that possible? I'm > also thinking about using it with Amazon s3 (disclosure: I work for > Amazon). If these changes are within the scope of where pig wants to > go (it eats everything, right), I'd be happy to work on the changes > myself. > > Spencer >