[ 
https://issues.apache.org/jira/browse/CRUNCH-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451248#comment-13451248
 ] 

Rahul Sharma commented on CRUNCH-57:
------------------------------------

If I look at sort, then we are putting data on MR in form of Writable-Key, 
Void-value. Now any data that is written in form of key on MR is supposed to be 
WritableComparable, This means that the converted format would be comparable. 
The point here is the writable type is comparable and not the type T. If we try 
to do it the other way i.e not making the writable as comparable and relying on 
T being comparable  I do not think that we would be able to tap the power 
hadoop, as the data could not be written as  key in hadoop and thus could not 
be sorted. This way could turn out to be highly inefficient. 

In the aggregate funcs of min/man currently we are putting data in form of 
Boolean-key/writable-value format.Now all keys values are false here.  But  we 
could implement the min/max APIs in the a way where we put hadoop sorting to 
some use. I am not sure but I believe this could be a bit more efficient than 
the current one. This would mean all the types would have a comparable.  
                
> Add a length function to PCollection
> ------------------------------------
>
>                 Key: CRUNCH-57
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-57
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.3.0
>            Reporter: Kiyan Ahmadizadeh
>            Assignee: Josh Wills
>         Attachments: CRUNCH-57.patch
>
>
> Sometimes it's useful and interesting to compute the number of elements in a 
> PCollection.
>  
> For example, suppose there was an initial PCollection that was then filtered 
> into another.  If I'm interested in how many elements of the original 
> PCollection matched the filter, I'll have to write extra code to compute this.
> PCollections should have a length method that, when called, computes the 
> number of elements in the PCollection and returns the result. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to