I am trying to understand how spark partitoing works.

To understand this I have following piece of code on spark 1.6

    def countByPartition1(rdd: RDD[(String, Int)]) = {
        rdd.mapPartitions(iter => Iterator(iter.length))
    }
    def countByPartition2(rdd: RDD[String]) = {
        rdd.mapPartitions(iter => Iterator(iter.length))
    }

//RDDs Creation
    val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa",
1)), 8)
    countByPartition(rdd1).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)

    val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
    countByPartition(rdd2).collect()
>> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
 
In both the cases data is distributed uniformaly.
I do have following questions on the basis of above observation:

 1. In case of rdd1, hash partitioning should calculate hashcode of key
(i.e. "aa" in this case), so all records should go to single partition
instead of uniform distribution?
 2. In case of rdd2, there is no key value pair so how hash partitoning
going to work i.e. what is the key to calculate hashcode?  

I have followed @zero323 answer but not getting answer of these.
https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work




-----

__Vikash Pareek
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-tp28785.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to