Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ovidiu-Cristian MARCU
Hi Ted, Any chance to develop more on the SQLConf parameters in the sense to have more explanations for changing these settings? Not all of them are made clear in the descriptions. Thanks! Best, Ovidiu > On 31 May 2016, at 16:30, Ted Yu wrote: > > Maciej: > You can refer

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro
If you don't hesitate the newest version, you try to use v2.0-preview. http://spark.apache.org/news/spark-2.0.0-preview.html There, you can control #partitions for input partitions without shuffles by two parameters below; spark.sql.files.maxPartitionBytes spark.sql.files.openCostInBytes ( Not

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu
Maciej: You can refer to the doc in sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala for these parameters. On Tue, May 31, 2016 at 7:27 AM, Takeshi Yamamuro wrote: > If you don't hesitate the newest version, you try to use v2.0-preview. >

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski
Thanks. At what conditions number of partitions can be higher than minPartitions when reading textFile? Should this be considered as unfrequent situation? To sum up - is there more efficient way to ensure exact number of partitions than following: rdd = sc.textFile("perf_test1.csv",

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski
After setting shuffle to true I get expected 128 partitions, but I'm worried about performance of such solution - especially I see that some shuffling is done because size of partitions chages: scala> sc.textFile("hdfs:///proj/dFAB_test/testdata/perf_test1.csv", minPartitions=128).coalesce(128,

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu
Value for shuffle is false by default. Have you tried setting it to true ? Which Spark release are you using ? On Tue, May 31, 2016 at 6:13 AM, Maciej Sokołowski wrote: > Hello Spark users and developers. > > I read file and want to ensure that it has exact number of

Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski
Hello Spark users and developers. I read file and want to ensure that it has exact number of partitions, for example 128. In documentation I found: def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] But argument here is minimal number of partitions, so I use

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski
Hello Spark users and developers. I read file and want to ensure that it has exact number of partitions, for example 128. In documentation I found: def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] But argument here is minimal number of partitions, so I use