[ 
https://issues.apache.org/jira/browse/CRUNCH-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963419#comment-14963419
 ] 

Adric Eckstein commented on CRUNCH-543:
---------------------------------------

I see what you mean about having all those writers open, and grouping by keys 
is certainly the safest way.  However, it can save a lot of time to avoid 
grouping especially if you have a large amount of data for a single key (which 
would kill all the parallelism).  This led me to try and use it for an 
ungrouped pcollection, however, because my keys were not necessarily sorted, it 
was constantly opening and closing writers, which i think was leading to some 
bad syncs.  

When i made these changes, it seemed to fix it so you could write out without 
grouping (making it substantially faster for the case mentioned above).  It 
seems to work well for a couple hundred files simultaneously, but that would 
obviously be a function of the input data.

> AvroPathPerKeyTarget copy nested subdirectories
> -----------------------------------------------
>
>                 Key: CRUNCH-543
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-543
>             Project: Crunch
>          Issue Type: Improvement
>          Components: IO
>            Reporter: Adric Eckstein
>            Assignee: Josh Wills
>             Fix For: 0.13.0
>
>         Attachments: CRUNCH-543.patch, CRUNCH-543b.patch, CRUNCH-543c.patch
>
>
> When using AvroPathPerKeyTarget to write out a subpath in the output 
> directory using a String key, the key might indicate multiple subfolders:
> Pair<String, String> kv = new Pair<String, String>("foo/bar", "value");
> PTable<String, String> kvs = 
> pipeline.create(Arrays.asList(kv),Avros.tableOf(Avros.strings(), 
> Avros.strings()));
> PTables.asPTable(kvs).write(new AvroPathPerKeyTarget("output"));
> This throws the error:
> java.io.IOException: java.lang.IllegalArgumentException: Reducer output name 
> 'bar' cannot be parsed
>       at 
> org.apache.crunch.impl.mr.exec.CrunchJobHooks$CompletionHook.handleMultiPaths(CrunchJobHooks.java:92)
> ...
> In AvroPathPerKeyTarget the handleOutputs method would need to recursively 
> copy subfolders (currently only checks first level in output directory) to 
> enable keys that define multiple sub folders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to