[ 
https://issues.apache.org/jira/browse/CRUNCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909543#comment-13909543
 ] 

Gabriel Reid commented on CRUNCH-351:
-------------------------------------

Looks to me like this will indeed work a lot better when there are a lot of 
unique elements with a costly compare. I guess it will also be slower if the 
input PCollection contains a large number of duplicate items with a low-cost 
compare, as the shuffle would then become extremely lightweight in terms of IO 
in the current version, while this patch would mean that all the duplicate 
elements would still go through the full shuffle.

I'm guessing that Shard is going to be much more commonly used for 
heavier-weight and mostly-unique values, so it makes sense to optimize for that 
use case, so I'd say this one sounds like a good idea to me.

About the use of the random long as the key, I was thinking it might be even 
better to use a limited-range int as the generated key, limiting the range of 
the generated keys to be just large enough that we're sure that they'll be 
decently distributed over the partitions. That way the shuffle will (I think) 
become even lighter weight because there would be fewer unique keys to sort. 
Does that sound right?

Also on the subject of the random key generation, is there any drawback to 
using a constant value as the seed for the Random? If not, it might be better 
to just use a constant to avoid the dependency on the MapReduce framework.

> Improve performance of Shard#shard on large records
> ---------------------------------------------------
>
>                 Key: CRUNCH-351
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-351
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Chao Shi
>            Assignee: Chao Shi
>         Attachments: crunch-351.patch
>
>
>     This avoids sorting on the input data, which may be long and make
>     shuffle phase slow. The improvement is to sort on pseudo-random numbers.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to