[ https://issues.apache.org/jira/browse/CRUNCH-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xavier updated CRUNCH-642: -------------------------- Attachment: CRUNCH-642-Enable-GroupingOptions-for-Distinct-operations.patch Hey [~joshwills], I noticed my change introduces a major bug when running the distinct operation with a non-memory PCollection. My apologies for this terrible mistake. In attachment is an additional patch that solves this by passing along the GroupOptions object instead of a numReducers integer. This will be more flexible and avoids bugs like this popping up. I also added tests (both unit and integration tests) to ensure the fix is now working. > Enable numReducers option for methods in Distinct > ------------------------------------------------- > > Key: CRUNCH-642 > URL: https://issues.apache.org/jira/browse/CRUNCH-642 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.14.0 > Reporter: Xavier > Assignee: Josh Wills > Priority: Trivial > Attachments: > CRUNCH-642-Enable-GroupingOptions-for-Distinct-operations.patch, > CRUNCH-642.patch > > > The {{groupByKey}} invocation in the {{Distinct}} class currently uses the > default (recommended) number of reducers without providing an option to > override this: > {code} > public static <S> PCollection<S> distinct(PCollection<S> input, int > flushEvery) { > Preconditions.checkArgument(flushEvery > 0); > PType<S> pt = input.getPType(); > PTypeFamily ptf = pt.getFamily(); > return input > .parallelDo("pre-distinct", new PreDistinctFn<S>(flushEvery, pt), > ptf.tableOf(pt, ptf.nulls())) > .groupByKey() > .parallelDo("post-distinct", new PostDistinctFn<S>(), pt); > } > {code} > Would it be possible to enhance this method such that it is possible to > customize the number of reducers? Either explicitly or via a > {{GroupingOptions}} object. -- This message was sent by Atlassian JIRA (v6.3.15#6346)