Hello all, I have a list of log entries that I want to partition by a key. I Map the list to RDD of String,String. I then find the number of unique keys and use that to determine the number of partitions. I use RDD.PartitionBy(new HashPartitioner(# of partitions)). When I look at the results man of the partitions are empty, while others have keys in them that should be excluded. Any idea of why this is? I have also tried it with the RangePartitioner. Same result. Some of the partitions will be very small meaning that some will only have 5 entries while others have millions (if this helps). I have tried running the same program in Cloudera's mapReduce and hive and it seems to work on that platform. Is there something I'm missing?
Thanks, Erik
