That’s a good point. We’re starting with small data and increasing so I think of it as large but right now, not so much—I lied about the thousands, only hundreds right now.
I bet that’s it. On Apr 23, 2015, at 10:26 AM, Andrew Musselman <[email protected]> wrote: Not sure about your specific situation but it reminds me of wondering why a job only has one mapper assigned to it; is the total dataset big enough to require partitioning? On Thursday, April 23, 2015, Pat Ferrel <[email protected]> wrote: > Using Spark streaming to create a large volume of small nano-batch input > files, ~4k per file, thousands of ‘part-xxxxx’ files. When reading the > nano-batch files and doing a cooccurrence calculation my tasks run only on > the machine where it was launched. I’m launching in “yarn-client” mode. The > rdd is created using sc.textFile(“list of thousand files”) > > The driver launches the sc.textFile then creates several intermediate rdds > and finally a DrmRdd[Int]. This goes into cooccurrence. From the read > onward, all tasks run only on the machine where the driver was launched. > > What would cause the read to occur only on the machine that launched the > driver? I’ve seen this with and without Yarn. > > Do I need to do something to the RDD after reading? Has some partition > factor been applied to all derived rdds?
