[ 
https://issues.apache.org/jira/browse/CRUNCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004621#comment-15004621
 ] 

Preston Koprivica commented on CRUNCH-527:
------------------------------------------

Please tell me if I'm doing something strange here, but I think I'm seeing a 
slight change in behavior since the introduction of this change (we recently 
upgraded to 0.11.0-cdh5.4.8 from 0.11.0-cdh5.4.3).  It appears that if a a job 
is configured both by a MapreduceTarget (typically via 
#configureForMapreduce(Job job, ...)) and has a groupBy() operation, the 
grouping options takes precedence.  Prior to this change, there was no default 
partitioner meaning that the target configuration always applied.   However, 
since this was applied, this new partition always takes precedence regardless 
of the target configuration.

Was this an unintentional side effect?  Or was this always a known behavior?  
Should we not be relying on the Target to configure the Job?


> Improve distribution of keys when using default (hash-based) partitioning
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-527
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-527
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>             Fix For: 0.13.0
>
>         Attachments: CRUNCH-527.patch
>
>
> The default partitioner used for MR-based pipelines bases itself on the hash 
> code of keys modulo the number of partitions, along the lines of 
> {code}int partition = key.hashCode() % numPartitions{code}
> This approach dependent on the _lower bits_ of the hash code being uniformly 
> distributed. If the lower bits of the key hash code is not uniformly 
> distributed, the key space will not be uniformly distributed over the 
> partitions.
> It can be surprisingly easy to get a very poor distribution. For example, if 
> the keys are integer values and are all divisible by 2, then only half of the 
> partitions will receive data (as the hash code of an integer is the integer 
> value itself).
> This can even be a problem in situations where you would really not expect 
> it. For example, taking the byte-array representation of longs for each 
> timestamp of each second over a period of 24 hours (at millisecond 
> granularity) and partitioning it over 50 partitions results in 34 of the 50 
> partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that 
> applies a supplementary hash function to the return value of the key's 
> hashCode method. This same approach is taken in java.util.HashMap for the 
> same reason.
> Note that this same approach was proposed in MAPREDUCE-4827, but wasn't 
> committed (mostly) because of backwards compatibility issues (some people may 
> have counted on certain records showing up in a given output file). Seeing as 
> Crunch is a higher abstraction above MR, I assume that we don't need to worry 
> about the backwards compatibility issue as much, but there may be other 
> opinions on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to