Using Spark streaming to create a large volume of small nano-batch input files, 
~4k per file, thousands of ‘part-xxxxx’ files.  When reading the nano-batch 
files and doing a cooccurrence calculation my tasks run only on the machine 
where it was launched. I’m launching in “yarn-client” mode. The rdd is created 
using sc.textFile(“list of thousand files”)

The driver launches the sc.textFile then creates several intermediate rdds and 
finally a DrmRdd[Int]. This goes into cooccurrence. From the read onward, all 
tasks run only on the machine where the driver was launched.

What would cause the read to occur only on the machine that launched the 
driver? I’ve seen this with and without Yarn.

Do I need to do something to the RDD after reading? Has some partition factor 
been applied to all derived rdds?

Reply via email to