[ 
https://issues.apache.org/jira/browse/CRUNCH-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904563#comment-13904563
 ] 

Jason Gauci commented on CRUNCH-347:
------------------------------------

I guess what the issue is asking for is more granularity on the 
crunch.max.reducers.  If I set this configuration parameter to '1', then I 
would enforce one reducer and thus create one file.  It would be nice if I 
could force one reducer on the final mapreduce in the job that needs to output 
a single file without affecting the other mapreduces in the pipeline.

Another approach would be a utility function that takes a materialized 
PCollection that could be composed of many files on HDFS and merges them into 
one file by using an identity mapper & reducer but with the max # of reducers 
in that mapreduce set to 1.


> Allow writing of single file outputs
> ------------------------------------
>
>                 Key: CRUNCH-347
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-347
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>    Affects Versions: 0.9.0
>            Reporter: Jason Gauci
>            Priority: Minor
>
> One of the outputs from our system needs to be a single file to support a 
> system that is ingesting the data downstream.  We currently run the job and 
> then cat the output files together to create the final output, but it would 
> be nice if we could pass a flag to the write(...) function to handle this 
> case.
> Note that setting the number of reducers globally for the entire job doesn't 
> work in this case because of the significant performance implications.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to