Casper Rasmussen
Mon, 07 Apr 2008 04:44:28 -0700
Sounds really interesting, I'm actually doing some S3 and simpleDb operations at the moment using Pig, so I'm really curious to hear from others facing similar challenges. Basically what I'm doing is processing the logfiles generated by S3, and while processing I hit simpleDb for additional data. I really thought about loading the logfiles in a special pig load but the conclusion of my effort was to use approach (1), a simple move (S3 to HDFS) before processing. The reason: 1) Listing buckets has limitations. 2) Logfiles from S3 is already delayed apx. 2 hours. so I really have no pressure. 3) The result of the processing is going back to S3, and with a pricetag on each GET, POST, I'd rather process bigger chunks updating less frequent. 4) Why impose complexity when a simple move does the job controllable and perfectly. Please keep posting if u have second thoughts about this, and of course if you succeed in extending pig with some handy amazon tools :-) Br Casper On Mon, Apr 7, 2008 at 5:27 AM, pi song <[EMAIL PROTECTED]> wrote: > Spencer, > > That's right. At the moment, Pig cannot load data from DB. However this is > a > feature that I'm after either. You can have a look at LOLoad and POLoad > and > may have to refactor them a bit. In the current implementation Pig only > reads input from HDFS (If you specify a local file as input, the file will > be copied to HDFS before the process kicks off). I'm thinking about two > solutions:- > > 1) Stage you data into a file in HDFS. This sounds inefficient (but > practical). > 2) Have special input splitter and record reader that reads data from DB. > This seems to be efficient but if your dataset is too large your DB will > become the bottleneck and drag the whole system down (This breaks the > semantic "Hadoop MapReduce runs on top of a reliable file system"). > > Regarding test and dev, now you can run Pig in local execution mode and > local hadoop mode (it creates hadoop processes on local machine). > > Cheers, > Pi > > On 4/7/08, Spencer Proffit <[EMAIL PROTECTED]> wrote: > > > > Pig comes very close to meeting my needs, however I need to be able to > > load from DB. I noticed in Jira there was a patch to remove > > dependency on file names from the loader. I'd also like it to scale > > down so I can test and dev without hadoop. Is that possible? I'm > > also thinking about using it with Amazon s3 (disclosure: I work for > > Amazon). If these changes are within the scope of where pig wants to > > go (it eats everything, right), I'd be happy to work on the changes > > myself. > > > > Spencer > > >