Dear Spark developers,
I have 100 binary files in local file system that I want to load into Spark
RDD. I need the data from each file to be in a separate partition. However, I
cannot make it happen:
scala> sc.binaryFiles("/data/subset").partitions.size
res5: Int = 66
The "minPartitions" parameter does not seems to help:
scala> sc.binaryFiles("/data/subset", minPartitions = 100).partitions.size
res8: Int = 66
At the same time, Spark produces the required number of partitions with
sc.textFiles (though I cannot use it because my files are binary):
scala> sc.textFile("/data/subset").partitions.size
res9: Int = 100
Could you suggest how to force Spark to load binary files each in a separate
partition?
Best regards, Alexander