[ 
https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425423#comment-15425423
 ] 

Mikael Goldmann commented on CRUNCH-601:
----------------------------------------

Regrettably, this test will still fail because the getSize() returns 0.
{code}
  @Test
  public void sizeEstimateZero() throws Exception {
    Pipeline p = new SparkPipeline("local", "foobar");
    try {

      final PCollection<String>  collection =
          p.emptyPCollection(Avros.strings()).parallelDo(new DoFn<String, 
String>() {
            @Override
            public void process(String input, Emitter<String> emitter) {
              emitter.emit(input);
            }

            @Override
            public void cleanup(Emitter<String> emitter) {
              emitter.emit("apelsin");
            }
          }, Avros.strings());

      final PObject<Long> length = collection.length();
      p.run();
      assertThat(length.getValue(), is(1L));
    }finally {
      p.done();
    }
  }
{code}
The reson being that only the empty collection contributes to the size estimate.

> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>
>                 Key: CRUNCH-601
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-601
>             Project: Crunch
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 0.13.0
>         Environment: Running in local mode on Mac as well as in a ubuntu 
> 14.04 docker container
>            Reporter: Mikael Goldmann
>            Priority: Minor
>         Attachments: CRUNCH-601.patch, SmallCollectionLengthTest.java
>
>
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the 
> lengths, runs the pipeline and prints the lengths. Finally it asserts that 
> all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized 
> object and assuming that when the estimate that getSize() returns is 0, then 
> the PCollection is guaranteed to be empty, which is false in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to