[jira] [Commented] (CRUNCH-306) MultipleOutput Targets

Micah Whitacre (JIRA) Tue, 10 Dec 2013 19:07:30 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845026#comment-13845026
 ]


Micah Whitacre commented on CRUNCH-306:
---------------------------------------

I think my/Bryan's use case is slightly different than Jeremy's in that we 
don't expect the files to be named "key.avro" but instead were thinking 
/<basePath>/<some key derived path>/part-*-*.avro  This would eliminate the 
thread contention if a key existed in multiple partitions.

Jeremy would that work for you?  Since the AvroFileSource would support reading 
from a directory you could still consume it in a similar fashion without it 
being a single file.

Looking at the AvroFilePerKeyTarget/AvroFilePerKeyOutputFormat should we also 
document the hint that sorting by keys would be helpful as well to have 
improved performance (less opening and closing of files).  I'd most will be 
doing a GBK to ensure a single partition and then would get this naturally as 
part of the ungroup() but this wouldn't be the case if they are doing it in the 
map only.

> MultipleOutput Targets
> ----------------------
>
>                 Key: CRUNCH-306
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-306
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Josh Wills
>         Attachments: CRUNCH-306.patch, CRUNCH-306b.patch
>
>
> A commonly desired feature for Crunch is the ability to write an output file 
> for each key in a PTable/PGroupedTable containing the values associated with 
> that key. We should find a way to support that one-output-per-key model.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (CRUNCH-306) MultipleOutput Targets

Reply via email to