[ https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425337#comment-15425337 ]
Mikael Goldmann commented on CRUNCH-601: ---------------------------------------- This (on top of the patch) passes the test even for length zero (but there may of course be other corner cases as long as one trusts getSize() == 0). {code} // In crunch-core/src/main/java/org/apache/crunch/lib/Aggregate.java public static <S> PObject<Long> length(PCollection<S> collect) { PTypeFamily tf = collect.getTypeFamily(); PTable<Integer, Long> countTable = collect .parallelDo("Aggregate.count", new MapFn<S, Pair<Integer, Long>>() { public Pair<Integer, Long> map(S input) { return Pair.of(1, 1L); } public void cleanup(Emitter<Pair<Integer, Long>> e) { e.emit(Pair.of(1, 0L)); } }, tf.tableOf(tf.ints(), tf.longs())) .groupByKey(GroupingOptions.builder().numReducers(1).build()) .combineValues(Aggregators.SUM_LONGS()); PCollection<Long> count = countTable.values(); final FirstElementPObject<Long> first = new FirstElementPObject<>(count); return new PObject<Long>() { @Override public Long getValue() { final Long value = first.getValue(); return value == null ? 0 : value; } }; } {code} > Short PCollections in SparkPipeline get length null. > ---------------------------------------------------- > > Key: CRUNCH-601 > URL: https://issues.apache.org/jira/browse/CRUNCH-601 > Project: Crunch > Issue Type: Bug > Components: Spark > Affects Versions: 0.13.0 > Environment: Running in local mode on Mac as well as in a ubuntu > 14.04 docker container > Reporter: Mikael Goldmann > Priority: Minor > Attachments: CRUNCH-601.patch, SmallCollectionLengthTest.java > > > I'll attach a file with a test that I would expect to pass but which fails. > It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the > lengths, runs the pipeline and prints the lengths. Finally it asserts that > all lengths are non-null. > I would expect it to print lengths 0, 1, 2, 3, 4 and pass. > What it does is print lengths null, null, null, 3, 4 and fail. > I think the underlying reason is the use of getSize() on an unmaterialized > object and assuming that when the estimate that getSize() returns is 0, then > the PCollection is guaranteed to be empty, which is false in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)