Re: Partitioning

Tathagata Das Fri, 13 Mar 2015 15:05:36 -0700

If you want to access the keys in an RDD that is partition by key, then you
can use RDD.mapPartition(), which gives you access to the whole partition
as an iterator<key, value>. You have the option of maintaing the
partitioning information or not by setting the preservePartitioning flag in
mapPartition (see docs). But use it at your own risk. If you modify the
keys, and yet preserve partitioning, the partitioning would not make sense
any more as the hash of the keys have changed.


TD



On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia <mohitanch...@gmail.com>
wrote:

> I am trying to look for a documentation on partitioning, which I can't
> seem to find. I am looking at spark streaming and was wondering how does it
> partition RDD in a multi node environment. Where are the keys defined that
> is used for partitioning? For instance in below example keys seem to be
> implicit:
>
> Which one is key and which one is value? Or is it called a flatMap because
> there are no keys?
>
> // Split each line into words
> JavaDStream<String> words = lines.flatMap(
>   new FlatMapFunction<String, String>() {
>     @Override public Iterable<String> call(String x) {
>       return Arrays.asList(x.split(" "));
>     }
>   });
>
>
> And are Keys available inside of Function2 in case it's required for a
> given use case ?
>
>
> JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
>   new Function2<Integer, Integer, Integer>() {
>     @Override public Integer call(Integer i1, Integer i2) throws Exception
> {
>       return i1 + i2;
>     }
>   });
>
>
>
>
>

Re: Partitioning

Reply via email to