Re: textFile partitions

2015-02-09 Thread Kostas Sakellis
The partitions parameter to textFile is the minPartitions. So there will
be at least that level of parallelism. Spark delegates to Hadoop to create
the splits for that file (yes, even for a text file on disk and not hdfs).
You can take a look at the code in FileInputFormat - but briefly it will
compute the block size to use and create at least the number of partitions
passed into it. It can create more blocks.

Hope this helps,
Kostas

On Mon, Feb 9, 2015 at 8:00 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Hi folks, puzzled by something pretty simple:

 I have a standalone cluster with default parallelism of 2, spark-shell
 running with 2 cores

 sc.textFile(README.md).partitions.size returns 2 (this makes sense)
 sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
 also makes sense

 but

 sc.textFile(README.md,100).partitions.size
  gives 102 --I was expecting this to be equivalent to last statement
 (i.e.result in 100 partitions)

 I'd appreciate if someone can enlighten me as to why I end up with 102
 This is on Spark 1.2

 thanks



textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple:

I have a standalone cluster with default parallelism of 2, spark-shell
running with 2 cores

sc.textFile(README.md).partitions.size returns 2 (this makes sense)
sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
also makes sense

but

sc.textFile(README.md,100).partitions.size
 gives 102 --I was expecting this to be equivalent to last statement
(i.e.result in 100 partitions)

I'd appreciate if someone can enlighten me as to why I end up with 102
This is on Spark 1.2

thanks