Mikael Goldmann created CRUNCH-601:
--------------------------------------
Summary: Short PCollections is SparkPipeline get length null.
Key: CRUNCH-601
URL: https://issues.apache.org/jira/browse/CRUNCH-601
Project: Crunch
Issue Type: Bug
Components: Spark
Affects Versions: 0.13.0
Environment: Running in local mode on Mac as well as in a ubuntu 14.04
docker container
Reporter: Mikael Goldmann
Priority: Minor
I'll attach a file with a test that I would expect to pass but which fails.
It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths,
runs the pipeline and prints the lengths. Finally it asserts that all lengths
are non-null.
I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
What it does is print lengths null, null, null, 3, 4 and fail.
I think the underlying reason is the user of getSize() on an unmaterialized
object and assuming that when the estimate that getSize() returns is 0, then
the PCollection is guaranteed to be empty, which is false in some cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)