There is a suggestion to set the number of reducers to a prime number
closest to the number of nodes and number of mappers a prime number
closest to several times the number of nodes in the cluster. But there
is also saying that "There is no need for the number of reduces to be
prime. The only thing it helps is if you are using the HashPartitioner
and your key's hash function is too linear. In practice, you usually
want to use 99% of your reduce capacity of the cluster."
Could anyone explain what is the theory behind the prime number and the
hash function here?
Shi
- Prime number of reduces vs. linear hash function Shi Yu
-