Is this what are you looking for ? In Shark, default reducer number is 1 and is controlled by the property mapred.reduce.tasks. Spark SQL deprecates this property in favor ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize this property via SET: SET spark.sql.shuffle.partitions=10; SELECT page, count(*) c FROM logs_last_month_cached GROUP BY page ORDER BY c DESC LIMIT 10;
Spark SQL Programming Guide - Spark 1.1.0 Documentation Spark SQL Programming Guide - Spark 1.1.0 Documentation Spark SQL Programming Guide Overview Getting Started Data Sources RDDs Inferring the Schema Using Reflection Programmatically Specifying the Schema Parquet Files Loading Data Programmatically View on spark.apache.org Preview by Yahoo ________________________________ From: shahab <shahab.mok...@gmail.com> To: user@spark.apache.org Sent: Tuesday, October 28, 2014 3:20 PM Subject: How can number of partitions be set in "spark-env.sh"? I am running a stand alone Spark cluster, 2 workers each has 2 cores. Apparently, I am loading and processing relatively large chunk of data so that I receive task failure " " . As I read from some posts and discussions in the mailing list the failures could be related to the large size of processing data in the partitions and if I have understood correctly I should have smaller partitions (but many of them) ?! Is there any way that I can set the number of partitions dynamically in "spark-env.sh" or in the submiited Spark application? best, /Shahab