Thanks Ankur, I just opened a jira: https://issues.apache.org/jira/browse/PIG-1174
On Thu, Dec 24, 2009 at 2:18 AM, Ankur C. Goel <gan...@yahoo-inc.com> wrote: > (Coming in late on this) > > Bill, > Please feel free to open a JIRA to report the issue. The problem is > that by the time MultiStorage gets into action, pig already creates the > output file based on the path (after INTO) and assumes that UDF will start > writing to it. Ideally the decision to create files/dir should be left > completely to the custom STORE UDF. Also MultiStorage should take care of > getting output path from UDF context so that user does not need to pass it > again. Lastly passing field name instead of field number to use as dynamic > key would be clearer in the script. > > I am hoping the changes would be a lot easier to do after the current > Load/Store redesign is implemented. > > -...@nkur > > > 12/16/09 10:24 PM, "Bill Graham" <billgra...@gmail.com> wrote: > > Thanks Dmitriy, this is exactly what I need. > > There was one bug I ran into though FYI, which is when making a request > like > this, as documented in the JavaDocs: > > STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', > 'none', '\t'); > > Pig would create a file '/my/home/output' and then an exception would be > thrown when MultiStorage tried to make a directory under '/my/home/output'. > The workaround that worked for me was to instead specify a dummy location > as > the first path like so: > > STORE A INTO '/my/home/output/temp' USING > MultiStorage('/my/home/output','0', 'none', '\t'); > > > On Tue, Dec 15, 2009 at 1:06 PM, Dmitriy Ryaboy <dvrya...@gmail.com> > wrote: > > > Bill, > > A custom storefunc should do the trick. See > > https://issues.apache.org/jira/browse/PIG-958 (aka > > piggybank.storage.MultiStorage) for a jumping-off point. > > > > -D > > > > On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham <billgra...@gmail.com> > wrote: > > > Hi, > > > > > > I'm pretty sure the answer to my question is no, but I have to ask. Is > it > > > possible within Pig to store different groups of data into different > > output > > > files where the grouping is dynamic (i.e. not known ahead of time)? > > Here's > > > what I'm trying to do... > > > > > > I've got a script that reads log files of URLs and generates counts for > a > > > given time period. The urls might have a 'tag' querystring param > though, > > and > > > in that case I want to get the most popular urls for each tag output to > > it's > > > own file. > > > > > > My data looks like this and is ordered by tag asc, count desc: > > > > > > [tag] [timeinterval] [url] [count] > > > > > > I need to do something like so: > > > > > > for each tag group found > > > store all records in file foo_[tag].txt > > > > > > I ultimately need these files on local disk and I'm looking for a > better > > way > > > to do so than generating a file of N unique tags in HDFS, reading it > from > > > Java, submitting N jobs with the tag name substituted into a script > file, > > > followed by N copyToLocal calls. > > > > > > At least two possible solutions come to mind, but am curious if there's > > > another that I'm overlooking: > > > 1. In java submit pig dynamic commands to an instance of PigServer. I'd > > > still need a unique tag file for this case. > > > 2. Maybe with a custom store function?? > > > > > > thanks, > > > Bill > > > > > > >