(Coming in late on this)

Bill,
     Please feel  free to open a JIRA to report the issue. The problem is that 
by the time MultiStorage gets into action, pig already creates the output file 
based on the path (after INTO) and assumes that UDF will start writing to it. 
Ideally the decision to create files/dir should be left completely to the 
custom STORE UDF.  Also MultiStorage should take care of getting output path 
from UDF context so that user does not need to pass it again. Lastly passing 
field name instead of field number to use as dynamic key would be clearer in 
the script.

I am hoping the changes would be a lot easier to do after the current 
Load/Store redesign is implemented.

-...@nkur

 12/16/09 10:24 PM, "Bill Graham" <billgra...@gmail.com> wrote:

Thanks Dmitriy, this is exactly what I need.

There was one bug I ran into though FYI, which is when making a request like
this, as documented in the JavaDocs:

STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0',
'none', '\t');

Pig would create a file '/my/home/output' and then an exception would be
thrown when MultiStorage tried to make a directory under '/my/home/output'.
The workaround that worked for me was to instead specify a dummy location as
the first path like so:

STORE A INTO '/my/home/output/temp' USING
MultiStorage('/my/home/output','0', 'none', '\t');


On Tue, Dec 15, 2009 at 1:06 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> Bill,
> A custom storefunc should do the trick. See
> https://issues.apache.org/jira/browse/PIG-958  (aka
> piggybank.storage.MultiStorage) for a jumping-off point.
>
> -D
>
> On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham <billgra...@gmail.com> wrote:
> > Hi,
> >
> > I'm pretty sure the answer to my question is no, but I have to ask. Is it
> > possible within Pig to store different groups of data into different
> output
> > files where the grouping is dynamic (i.e. not known ahead of time)?
> Here's
> > what I'm trying to do...
> >
> > I've got a script that reads log files of URLs and generates counts for a
> > given time period. The urls might have a 'tag' querystring param though,
> and
> > in that case I want to get the most popular urls for each tag output to
> it's
> > own file.
> >
> > My data looks like this and is ordered by tag asc, count desc:
> >
> > [tag] [timeinterval] [url] [count]
> >
> > I need to do something like so:
> >
> > for each tag group found
> >  store all records in file foo_[tag].txt
> >
> > I ultimately need these files on local disk and I'm looking for a better
> way
> > to do so than generating a file of N unique tags in HDFS, reading it from
> > Java, submitting N jobs with the tag name substituted into a script file,
> > followed by N copyToLocal calls.
> >
> > At least two possible solutions come to mind, but am curious if there's
> > another that I'm overlooking:
> > 1. In java submit pig dynamic commands to an instance of PigServer. I'd
> > still need a unique tag file for this case.
> > 2. Maybe with a custom store function??
> >
> > thanks,
> > Bill
> >
>

Reply via email to