Hello all,

I have a list of log entries that I want to partition by a key. I Map
the list to RDD of String,String. I then find the number of unique
keys and use that to determine the number of partitions. I use
RDD.PartitionBy(new HashPartitioner(# of partitions)). When I look at
the results man of the partitions are empty, while others have keys in
them that should be excluded. Any idea of why this is? I have also
tried it with the RangePartitioner. Same result. Some of the
partitions will be very small meaning that some will only have 5
entries while others have millions (if this helps). I have tried
running the same program in Cloudera's mapReduce and hive and it seems
to work on that platform. Is there something I'm missing?


Thanks,

Erik

Reply via email to