[ https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429533#comment-15429533 ]
Mikael Goldmann commented on CRUNCH-601: ---------------------------------------- If I understand correctly * It is important that p.getSize() > 0 if p is not empty, or processing might be skipped incorrectly. * Unless p.getSize() == 0 at least sometimes, the branches that skip computation are never taken and could be removed. So assume that p is empty and p.getSize() == 0. Form q = p.parallelDo(dofn); where process(x, emitter) simply does emitter.emit(x) and there is a cleanup(emitter) that does emitter.emit(something). Now, q is not empty since it consists of 'something'. It seems like it would be a bug if q.getSize() == 0. However, it seems like the current implementation, even when this patch is applied would give q.getSize() == 0. Am I missing something in my assumptions? > Short PCollections in SparkPipeline get length null. > ---------------------------------------------------- > > Key: CRUNCH-601 > URL: https://issues.apache.org/jira/browse/CRUNCH-601 > Project: Crunch > Issue Type: Bug > Components: Spark > Affects Versions: 0.13.0 > Environment: Running in local mode on Mac as well as in a ubuntu > 14.04 docker container > Reporter: Mikael Goldmann > Assignee: Micah Whitacre > Priority: Minor > Attachments: CRUNCH-601.patch, CRUNCH-601b.patch, CRUNCH-601c.patch, > SmallCollectionLengthTest.java > > > I'll attach a file with a test that I would expect to pass but which fails. > It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the > lengths, runs the pipeline and prints the lengths. Finally it asserts that > all lengths are non-null. > I would expect it to print lengths 0, 1, 2, 3, 4 and pass. > What it does is print lengths null, null, null, 3, 4 and fail. > I think the underlying reason is the use of getSize() on an unmaterialized > object and assuming that when the estimate that getSize() returns is 0, then > the PCollection is guaranteed to be empty, which is false in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)