[ 
https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423764#comment-15423764
 ] 

Micah Whitacre commented on CRUNCH-601:
---------------------------------------

So I think I figured out the reason but not really sure of the fix just yet.  
When doing a .length()[1] call it then calls Aggregate.length(this) and does 
this[2].

The issue is that the "count" PCollection in that method it has 4 different 
parents.  It does calculate the expected size of the PCollection to proactively 
materialize or not by using the scaleFactor * parent.  So for a collection of 
size 1 it is essentially calculating (.99f * (long)(.99f * (long)(.99f * 
(long)(.99f *1)))).  And casting to long causes it to round down.  so the first 
.99f * 1 = 0 when cast to a long.  So the smaller values are invalid because 
the scale factor makes their size go to zero which then makes the materialize 
call return the empty PCollection.

[1] - 
https://github.com/apache/crunch/blob/0b19717d105b58e58c1947eda6b673a387e330d0/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L284
[2] - 
https://github.com/apache/crunch/blob/0b19717d105b58e58c1947eda6b673a387e330d0/crunch-core/src/main/java/org/apache/crunch/lib/Aggregate.java#L94

> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>
>                 Key: CRUNCH-601
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-601
>             Project: Crunch
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 0.13.0
>         Environment: Running in local mode on Mac as well as in a ubuntu 
> 14.04 docker container
>            Reporter: Mikael Goldmann
>            Priority: Minor
>         Attachments: SmallCollectionLengthTest.java
>
>
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the 
> lengths, runs the pipeline and prints the lengths. Finally it asserts that 
> all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized 
> object and assuming that when the estimate that getSize() returns is 0, then 
> the PCollection is guaranteed to be empty, which is false in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to