Re: How does HashPartitioner distribute data in Spark?

Vikash Pareek Sat, 24 Jun 2017 08:54:07 -0700

Hi Vadim,

Thank you for your response.


I would like to know how partitioner choose the key, If we look at my
example then following question arises:
1. In case of rdd1, hash partitioning should calculate hashcode of key
(i.e. *"aa"* in this case), so *all records should go to single partition*
instead of uniform distribution?
 2. In case of rdd2, there is no key value pair so how hash partitoning
going to work i.e. *what is the key* to calculate hashcode?



Best Regards,


[image: InfoObjects Inc.] <http://www.infoobjects.com/>
Vikash Pareek
Team Lead  *InfoObjects Inc.*
Big Data Analytics

m: +91 8800206898 a: E5, Jhalana Institutionall Area, Jaipur, Rajasthan
302004
w: www.linkedin.com/in/pvikash e: vikash.par...@infoobjects.com




On Fri, Jun 23, 2017 at 10:38 PM, Vadim Semenov <vadim.seme...@datadoghq.com
> wrote:

> This is the code that chooses the partition for a key: https://github.com/
> apache/spark/blob/master/core/src/main/scala/org/apache/
> spark/Partitioner.scala#L85-L88
>
> it's basically `math.abs(key.hashCode % numberOfPartitions)`
>
> On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
> vikash.par...@infoobjects.com> wrote:
>
>> I am trying to understand how spark partitoing works.
>>
>> To understand this I have following piece of code on spark 1.6
>>
>>     def countByPartition1(rdd: RDD[(String, Int)]) = {
>>         rdd.mapPartitions(iter => Iterator(iter.length))
>>     }
>>     def countByPartition2(rdd: RDD[String]) = {
>>         rdd.mapPartitions(iter => Iterator(iter.length))
>>     }
>>
>> //RDDs Creation
>>     val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1),
>> ("aa",
>> 1)), 8)
>>     countByPartition(rdd1).collect()
>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>
>>     val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
>>     countByPartition(rdd2).collect()
>> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>>
>> In both the cases data is distributed uniformaly.
>> I do have following questions on the basis of above observation:
>>
>>  1. In case of rdd1, hash partitioning should calculate hashcode of key
>> (i.e. "aa" in this case), so all records should go to single partition
>> instead of uniform distribution?
>>  2. In case of rdd2, there is no key value pair so how hash partitoning
>> going to work i.e. what is the key to calculate hashcode?
>>
>> I have followed @zero323 answer but not getting answer of these.
>> https://stackoverflow.com/questions/31424396/how-does-hashpa
>> rtitioner-work
>>
>>
>>
>>
>> -----
>>
>> __Vikash Pareek
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-does-HashPartitioner-distribute-
>> data-in-Spark-tp28785.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: How does HashPartitioner distribute data in Spark?

Reply via email to