The partitions parameter to textFile is the minPartitions. So there will
be at least that level of parallelism. Spark delegates to Hadoop to create
the splits for that file (yes, even for a text file on disk and not hdfs).
You can take a look at the code in FileInputFormat - but briefly it will
compute the block size to use and create at least the number of partitions
passed into it. It can create more blocks.
Hope this helps,
Kostas
On Mon, Feb 9, 2015 at 8:00 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks, puzzled by something pretty simple:
I have a standalone cluster with default parallelism of 2, spark-shell
running with 2 cores
sc.textFile(README.md).partitions.size returns 2 (this makes sense)
sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
also makes sense
but
sc.textFile(README.md,100).partitions.size
gives 102 --I was expecting this to be equivalent to last statement
(i.e.result in 100 partitions)
I'd appreciate if someone can enlighten me as to why I end up with 102
This is on Spark 1.2
thanks