Re: MultipleOutputs in Crunch

Gabriel Reid Wed, 07 Aug 2013 11:20:14 -0700

If your data is going though a reducer, there's support for something like
this built in to Crunch, although it's not (yet) very developer-friendly.

If you have a custom Partitioner that maps each key to a pre-determined
partition id, you can implement a custom FileNamingScheme[1] and have then
map the output partition keys to a set filename that represents the content
under that key. I believe most (or all) Target implementations can be
instantiated with a FileNamingScheme object.

- Gabriel

[1]
http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/io/FileNamingScheme.html

On Wed, Aug 7, 2013 at 3:04 PM, Micah Whitacre <[email protected]> wrote:

> I believe you could accomplish this but creating PCollections for each of
> the key/values you want to persist and then writing[1] the PCollections out
> to whichever directories makes the most sense.
>
> [1] -
>
> http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/Pipeline.html#write(org.apache.crunch.PCollection
> ,
> org.apache.crunch.Target)
>
>
> On Wed, Aug 7, 2013 at 3:31 AM, Mridul Das <[email protected]> wrote:
>
> > Hi,
> >    MultipleOutputs enable us to generate custom file names base on
> > keys/values.
> >    How do we achieve this in Crunch?
> >
> > Regards,
> > Mridul
> >
>

Re: MultipleOutputs in Crunch

Reply via email to