Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-xxxxx’ files. When reading the nano-batch files and doing a cooccurrence calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The rdd is created using sc.textFile(“list of thousand files”)
The driver launches the sc.textFile then creates several intermediate rdds and finally a DrmRdd[Int]. This goes into cooccurrence. From the read onward, all tasks run only on the machine where the driver was launched. What would cause the read to occur only on the machine that launched the driver? I’ve seen this with and without Yarn. Do I need to do something to the RDD after reading? Has some partition factor been applied to all derived rdds?
