If you want to learn about how Spark partitions the data based on keys, here is a recent talk about that http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs?related=1
Of course you can read the original Spark paper https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Fri, Mar 13, 2015 at 3:52 PM, Gerard Maas <gerard.m...@gmail.com> wrote: > In spark-streaming, the consumers will fetch data and put it into > 'blocks'. Each block becomes a partition of the rdd generated during that > batch interval. > The size of each is block controlled by the conf: > 'spark.streaming.blockInterval'. That is, the amount of data the consumer > can collect in that time. > > The number of RDD partitions in a streaming interval will be then: batch > interval/ spark.streaming.blockInterval * # of consumers. > > -kr, Gerard > On Mar 13, 2015 11:18 PM, "Mohit Anchlia" <mohitanch...@gmail.com> wrote: > >> I still don't follow how spark is partitioning data in multi node >> environment. Is there a document on how spark does portioning of data. For >> eg: in word count eg how is spark distributing words to multiple nodes? >> >> On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> If you want to access the keys in an RDD that is partition by key, then >>> you can use RDD.mapPartition(), which gives you access to the whole >>> partition as an iterator<key, value>. You have the option of maintaing the >>> partitioning information or not by setting the preservePartitioning flag in >>> mapPartition (see docs). But use it at your own risk. If you modify the >>> keys, and yet preserve partitioning, the partitioning would not make sense >>> any more as the hash of the keys have changed. >>> >>> TD >>> >>> >>> >>> On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia <mohitanch...@gmail.com> >>> wrote: >>> >>>> I am trying to look for a documentation on partitioning, which I can't >>>> seem to find. I am looking at spark streaming and was wondering how does it >>>> partition RDD in a multi node environment. Where are the keys defined that >>>> is used for partitioning? For instance in below example keys seem to be >>>> implicit: >>>> >>>> Which one is key and which one is value? Or is it called a flatMap >>>> because there are no keys? >>>> >>>> // Split each line into words >>>> JavaDStream<String> words = lines.flatMap( >>>> new FlatMapFunction<String, String>() { >>>> @Override public Iterable<String> call(String x) { >>>> return Arrays.asList(x.split(" ")); >>>> } >>>> }); >>>> >>>> >>>> And are Keys available inside of Function2 in case it's required for a >>>> given use case ? >>>> >>>> >>>> JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey( >>>> new Function2<Integer, Integer, Integer>() { >>>> @Override public Integer call(Integer i1, Integer i2) throws >>>> Exception { >>>> return i1 + i2; >>>> } >>>> }); >>>> >>>> >>>> >>>> >>>> >>> >>