[ 
https://issues.apache.org/jira/browse/CRUNCH-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672592#comment-13672592
 ] 

Gabriel Reid commented on CRUNCH-211:
-------------------------------------

Yes, exactly, this is for situations where one side of the join is too big to 
fit in memory. 

I'm actually a bit torn as to whether or not this is a common-enough situation 
to warrant adding it to Crunch, but it's definitely something I need. 
                
> Add one-to-many join functionality
> ----------------------------------
>
>                 Key: CRUNCH-211
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-211
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>         Attachments: CRUNCH-211.patch
>
>
> A common pattern is a join between two tables where the left-side table 
> contains a single value per key, and the right-side table contains multiple 
> values per key. An example of such a join would be a join between users and 
> web click entries:
>     PTable<Long,User> usersById = ...;
>     PTable<Long,WebClick> webClicksByUserId = ...;
> In this case, there can be some situations where it is desirable to bring the 
> User together with the iterable of all WebClicks. The current join 
> functionality will replicate the User for each WebClick that it's related to, 
> but each WebClick then needs to be dealt with completely separately.
> Currently, the only way of getting an iterable of WebClicks together with a 
> single User in a single method call is by materializing all WebClicks per 
> user in memory using something like PTable#collectValues, and this approach 
> doesn't work when there are a large number of WebClicks.
> The intention of this ticket is to add functionality whereby the User and 
> Iterable of WebClicks are available in a single method call, without the 
> Iterable of WebClicks being materialized in memory (i.e. a feasible approach 
> for millions or more WebClicks).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to